1
00:00:08,435 --> 00:00:10,602
- Okay, let's get started.

2
00:00:13,372 --> 00:00:15,936
Alright, so welcome to lecture five.

3
00:00:15,936 --> 00:00:18,693
Today we're going to be getting
to the title of the class,

4
00:00:18,693 --> 00:00:21,193
Convolutional Neural Networks.

5
00:00:22,493 --> 00:00:24,134
Okay, so a couple of
administrative details

6
00:00:24,134 --> 00:00:25,933
before we get started.

7
00:00:25,933 --> 00:00:27,980
Assignment one is due Thursday,

8
00:00:27,980 --> 00:00:30,563
April 20, 11:59 p.m. on Canvas.

9
00:00:31,440 --> 00:00:35,607
We're also going to be releasing
assignment two on Thursday.

10
00:00:38,320 --> 00:00:40,434
Okay, so a quick review of last time.

11
00:00:40,434 --> 00:00:43,679
We talked about neural
networks, and how we had

12
00:00:43,679 --> 00:00:45,755
the running example of
the linear score function

13
00:00:45,755 --> 00:00:48,337
that we talked about through
the first few lectures.

14
00:00:48,337 --> 00:00:50,736
And then we turned this
into a neural network

15
00:00:50,736 --> 00:00:53,808
by stacking these linear
layers on top of each other

16
00:00:53,808 --> 00:00:56,969
with non-linearities in between.

17
00:00:56,969 --> 00:00:58,900
And we also saw that
this could help address

18
00:00:58,900 --> 00:01:01,500
the mode problem where
we are able to learn

19
00:01:01,500 --> 00:01:03,807
intermediate templates
that are looking for,

20
00:01:03,807 --> 00:01:06,618
for example, different
types of cars, right.

21
00:01:06,618 --> 00:01:09,006
A red car versus a yellow car and so on.

22
00:01:09,006 --> 00:01:11,138
And to combine these
together to come up with

23
00:01:11,138 --> 00:01:14,790
the final score function for a class.

24
00:01:14,790 --> 00:01:16,998
Okay, so today we're going to talk about

25
00:01:16,998 --> 00:01:18,438
convolutional neural networks,

26
00:01:18,438 --> 00:01:20,825
which is basically the same sort of idea,

27
00:01:20,825 --> 00:01:23,300
but now we're going to
learn convolutional layers

28
00:01:23,300 --> 00:01:26,134
that reason on top of basically explicitly

29
00:01:26,134 --> 00:01:29,217
trying to maintain spatial structure.

30
00:01:31,817 --> 00:01:33,397
So, let's first talk a little bit about

31
00:01:33,397 --> 00:01:36,070
the history of neural
networks, and then also

32
00:01:36,070 --> 00:01:39,067
how convolutional neural
networks were developed.

33
00:01:39,067 --> 00:01:43,796
So we can go all the way back
to 1957 with Frank Rosenblatt,

34
00:01:43,796 --> 00:01:46,308
who developed the Mark
I Perceptron machine,

35
00:01:46,308 --> 00:01:48,688
which was the first
implementation of an algorithm

36
00:01:48,688 --> 00:01:51,785
called the perceptron, which
had sort of the similar idea

37
00:01:51,785 --> 00:01:55,157
of getting score functions,
right, using some,

38
00:01:55,157 --> 00:01:58,437
you know, W times X plus a bias.

39
00:01:58,437 --> 00:02:02,000
But here the outputs are going
to be either one or a zero.

40
00:02:02,000 --> 00:02:04,295
And then in this case
we have an update rule,

41
00:02:04,295 --> 00:02:06,551
so an update rule for our weights, W,

42
00:02:06,551 --> 00:02:09,491
which also look kind of similar
to the type of update rule

43
00:02:09,491 --> 00:02:12,304
that we're also seeing in
backprop, but in this case

44
00:02:12,304 --> 00:02:15,889
there was no principled
backpropagation technique yet,

45
00:02:15,889 --> 00:02:18,182
we just sort of took the
weights and adjusted them

46
00:02:18,182 --> 00:02:22,349
in the direction towards
the target that we wanted.

47
00:02:23,771 --> 00:02:26,918
So in 1960, we had Widrow and Hoff,

48
00:02:26,918 --> 00:02:29,673
who developed Adaline and
Madaline, which was the first time

49
00:02:29,673 --> 00:02:33,290
that we were able to
get, to start to stack

50
00:02:33,290 --> 00:02:37,457
these linear layers into
multilayer perceptron networks.

51
00:02:38,986 --> 00:02:42,592
And so this is starting to now
look kind of like this idea

52
00:02:42,592 --> 00:02:46,658
of neural network layers, but
we still didn't have backprop

53
00:02:46,658 --> 00:02:50,992
or any sort of principled
way to train this.

54
00:02:50,992 --> 00:02:53,436
And so the first time
backprop was really introduced

55
00:02:53,436 --> 00:02:56,015
was in 1986 with Rumelhart.

56
00:02:56,015 --> 00:02:58,676
And so here we can start
seeing, you know, these kinds of

57
00:02:58,676 --> 00:03:00,858
equations with the chain
rule and the update rules

58
00:03:00,858 --> 00:03:03,906
that we're starting to
get familiar with, right,

59
00:03:03,906 --> 00:03:05,318
and so this is the first time we started

60
00:03:05,318 --> 00:03:06,791
to have a principled way to train

61
00:03:06,791 --> 00:03:09,874
these kinds of network architectures.

62
00:03:11,623 --> 00:03:14,961
And so after that, you know,
it still wasn't able to scale

63
00:03:14,961 --> 00:03:18,076
to very large neural networks,
and so there was sort of

64
00:03:18,076 --> 00:03:20,550
a period in which there wasn't a whole lot

65
00:03:20,550 --> 00:03:24,450
of new things happening
here, or a lot of popular use

66
00:03:24,450 --> 00:03:26,237
of these kinds of networks.

67
00:03:26,237 --> 00:03:28,623
And so this really started
being reinvigorated

68
00:03:28,623 --> 00:03:32,790
around the 2000s, so in
2006, there was this paper

69
00:03:33,641 --> 00:03:37,623
by Geoff Hinton and Ruslan Salakhutdinov,

70
00:03:37,623 --> 00:03:39,612
which basically showed that we could train

71
00:03:39,612 --> 00:03:40,719
a deep neural network,

72
00:03:40,719 --> 00:03:43,212
and show that we could
do this effectively.

73
00:03:43,212 --> 00:03:44,445
But it was still not quite

74
00:03:44,445 --> 00:03:47,428
the sort of modern iteration
of neural networks.

75
00:03:47,428 --> 00:03:50,208
It required really careful initialization

76
00:03:50,208 --> 00:03:52,439
in order to be able to do backprop,

77
00:03:52,439 --> 00:03:54,350
and so what they had
here was they would have

78
00:03:54,350 --> 00:03:57,601
this first pre-training
stage, where you model

79
00:03:57,601 --> 00:03:59,456
each hidden layer through this kind of,

80
00:03:59,456 --> 00:04:01,805
through a restricted Boltzmann machine,

81
00:04:01,805 --> 00:04:04,180
and so you're going to get
some initialized weights

82
00:04:04,180 --> 00:04:07,331
by training each of
these layers iteratively.

83
00:04:07,331 --> 00:04:09,583
And so once you get all
of these hidden layers

84
00:04:09,583 --> 00:04:13,898
you then use that to
initialize your, you know,

85
00:04:13,898 --> 00:04:16,891
your full neural network,
and then from there

86
00:04:16,891 --> 00:04:20,224
you do backprop and fine tuning of that.

87
00:04:23,057 --> 00:04:26,146
And so when we really started
to get the first really strong

88
00:04:26,146 --> 00:04:30,219
results using neural networks,
and what sort of really

89
00:04:30,219 --> 00:04:34,219
sparked the whole craze
of starting to use these

90
00:04:35,066 --> 00:04:39,233
kinds of networks really
widely was at around 2012,

91
00:04:40,268 --> 00:04:42,717
where we had first the strongest results

92
00:04:42,717 --> 00:04:44,980
using for speech recognition,

93
00:04:44,980 --> 00:04:47,921
and so this is work out
of Geoff Hinton's lab

94
00:04:47,921 --> 00:04:50,606
for acoustic modeling
and speech recognition.

95
00:04:50,606 --> 00:04:55,021
And then for image recognition,
2012 was the landmark paper

96
00:04:55,021 --> 00:04:58,604
from Alex Krizhevsky
in Geoff Hinton's lab,

97
00:04:59,638 --> 00:05:01,919
which introduced the first
convolutional neural network

98
00:05:01,919 --> 00:05:04,220
architecture that was able to do,

99
00:05:04,220 --> 00:05:06,813
get really strong results
on ImageNet classification.

100
00:05:06,813 --> 00:05:10,917
And so it took the ImageNet,
image classification benchmark,

101
00:05:10,917 --> 00:05:13,186
and was able to dramatically reduce

102
00:05:13,186 --> 00:05:15,519
the error on that benchmark.

103
00:05:16,793 --> 00:05:19,958
And so since then, you
know, ConvNets have gotten

104
00:05:19,958 --> 00:05:24,236
really widely used in all
kinds of applications.

105
00:05:24,236 --> 00:05:28,225
So now let's step back and
take a look at what gave rise

106
00:05:28,225 --> 00:05:31,714
to convolutional neural
networks specifically.

107
00:05:31,714 --> 00:05:34,113
And so we can go back to the 1950s,

108
00:05:34,113 --> 00:05:37,689
where Hubel and Wiesel did
a series of experiments

109
00:05:37,689 --> 00:05:41,003
trying to understand how neurons

110
00:05:41,003 --> 00:05:42,538
in the visual cortex worked,

111
00:05:42,538 --> 00:05:45,579
and they studied this
specifically for cats.

112
00:05:45,579 --> 00:05:48,273
And so we talked a little bit
about this in lecture one,

113
00:05:48,273 --> 00:05:51,362
but basically in these
experiments they put electrodes

114
00:05:51,362 --> 00:05:53,526
in the cat, into the cat brain,

115
00:05:53,526 --> 00:05:56,066
and they gave the cat
different visual stimulus.

116
00:05:56,066 --> 00:05:57,888
Right, and so, things like, you know,

117
00:05:57,888 --> 00:06:01,171
different kinds of edges, oriented edges,

118
00:06:01,171 --> 00:06:03,187
different sorts of
shapes, and they measured

119
00:06:03,187 --> 00:06:06,937
the response of the
neurons to these stimuli.

120
00:06:09,029 --> 00:06:12,765
And so there were a couple
of important conclusions

121
00:06:12,765 --> 00:06:14,993
that they were able to
make, and observations.

122
00:06:14,993 --> 00:06:17,021
And so the first thing
found that, you know,

123
00:06:17,021 --> 00:06:19,534
there's sort of this topographical
mapping in the cortex.

124
00:06:19,534 --> 00:06:22,246
So nearby cells in the
cortex also represent

125
00:06:22,246 --> 00:06:24,932
nearby regions in the visual field.

126
00:06:24,932 --> 00:06:27,767
And so you can see for
example, on the right here

127
00:06:27,767 --> 00:06:31,730
where if you take kind
of the spatial mapping

128
00:06:31,730 --> 00:06:34,475
and map this onto a visual cortex

129
00:06:34,475 --> 00:06:37,750
there's more peripheral
regions are these blue areas,

130
00:06:37,750 --> 00:06:41,722
you know, farther away from the center.

131
00:06:41,722 --> 00:06:44,122
And so they also discovered
that these neurons

132
00:06:44,122 --> 00:06:46,789
had a hierarchical organization.

133
00:06:47,634 --> 00:06:51,236
And so if you look at different
types of visual stimuli

134
00:06:51,236 --> 00:06:54,828
they were able to find
that at the earliest layers

135
00:06:54,828 --> 00:06:57,837
retinal ganglion cells
were responsive to things

136
00:06:57,837 --> 00:07:01,601
that looked kind of like
circular regions of spots.

137
00:07:01,601 --> 00:07:04,231
And then on top of that
there are simple cells,

138
00:07:04,231 --> 00:07:07,999
and these simple cells are
responsive to oriented edges,

139
00:07:07,999 --> 00:07:11,146
so different orientation
of the light stimulus.

140
00:07:11,146 --> 00:07:13,246
And then going further,
they discover that these

141
00:07:13,246 --> 00:07:15,448
were then connected to more complex cells,

142
00:07:15,448 --> 00:07:17,721
which were responsive to
both light orientation

143
00:07:17,721 --> 00:07:19,923
as well as movement, and so on.

144
00:07:19,923 --> 00:07:22,145
And you get, you know,
increasing complexity,

145
00:07:22,145 --> 00:07:25,452
for example, hypercomplex
cells are now responsive

146
00:07:25,452 --> 00:07:28,984
to movement with kind
of an endpoint, right,

147
00:07:28,984 --> 00:07:32,092
and so now you're starting
to get the idea of corners

148
00:07:32,092 --> 00:07:34,175
and then blobs and so on.

149
00:07:38,143 --> 00:07:38,976
And so

150
00:07:40,298 --> 00:07:44,247
then in 1980, the neocognitron
was the first example

151
00:07:44,247 --> 00:07:46,715
of a network architecture, a model,

152
00:07:46,715 --> 00:07:50,924
that had this idea of
simple and complex cells

153
00:07:50,924 --> 00:07:52,454
that Hubel and Wiesel had discovered.

154
00:07:52,454 --> 00:07:55,419
And in this case Fukushima put these into

155
00:07:55,419 --> 00:07:59,038
these alternating layers of
simple and complex cells,

156
00:07:59,038 --> 00:08:00,729
where you had these simple cells

157
00:08:00,729 --> 00:08:03,129
that had modifiable parameters,
and then complex cells

158
00:08:03,129 --> 00:08:06,799
on top of these that
performed a sort of pooling

159
00:08:06,799 --> 00:08:08,791
so that it was invariant to, you know,

160
00:08:08,791 --> 00:08:12,958
different minor modifications
from the simple cells.

161
00:08:14,786 --> 00:08:17,159
And so this is work that
was in the 1980s, right,

162
00:08:17,159 --> 00:08:19,242
and so by 1998 Yann LeCun

163
00:08:21,839 --> 00:08:23,445
basically showed the first example

164
00:08:23,445 --> 00:08:27,743
of applying backpropagation
and gradient-based learning

165
00:08:27,743 --> 00:08:29,645
to train convolutional neural networks

166
00:08:29,645 --> 00:08:32,063
that did really well on
document recognition.

167
00:08:32,063 --> 00:08:35,339
And specifically they
were able to do a good job

168
00:08:35,340 --> 00:08:37,610
of recognizing digits of zip codes.

169
00:08:37,610 --> 00:08:41,028
And so these were then used pretty widely

170
00:08:41,028 --> 00:08:45,082
for zip code recognition
in the postal service.

171
00:08:45,082 --> 00:08:48,320
But beyond that it
wasn't able to scale yet

172
00:08:48,320 --> 00:08:51,579
to more challenging and
complex data, right,

173
00:08:51,579 --> 00:08:53,837
digits are still fairly simple

174
00:08:53,837 --> 00:08:56,350
and a limited set to recognize.

175
00:08:56,350 --> 00:09:00,901
And so this is where
Alex Krizhevsky, in 2012,

176
00:09:00,901 --> 00:09:04,893
gave the modern incarnation of
convolutional neural networks

177
00:09:04,893 --> 00:09:08,900
and his network we sort of
colloquially call AlexNet.

178
00:09:08,900 --> 00:09:11,543
But this network really
didn't look so much different

179
00:09:11,543 --> 00:09:14,205
than the convolutional neural networks

180
00:09:14,205 --> 00:09:16,472
that Yann LeCun was dealing with.

181
00:09:16,472 --> 00:09:18,363
They're now, you know,
they were scaled now

182
00:09:18,363 --> 00:09:21,751
to be larger and deeper and able to,

183
00:09:21,751 --> 00:09:23,753
the most important parts
were that they were now able

184
00:09:23,753 --> 00:09:26,544
to take advantage of
the large amount of data

185
00:09:26,544 --> 00:09:30,711
that's now available, in web
images, in ImageNet data set.

186
00:09:32,078 --> 00:09:33,757
As well as take advantage

187
00:09:33,757 --> 00:09:37,724
of the parallel computing power in GPUs.

188
00:09:37,724 --> 00:09:41,033
And so we'll talk more about that later.

189
00:09:41,033 --> 00:09:43,127
But fast forwarding
today, so now, you know,

190
00:09:43,127 --> 00:09:45,434
ConvNets are used everywhere.

191
00:09:45,434 --> 00:09:49,999
And so we have the initial
classification results

192
00:09:49,999 --> 00:09:52,294
on ImageNet from Alex Krizhevsky.

193
00:09:52,294 --> 00:09:55,188
This is able to do a really
good job of image retrieval.

194
00:09:55,188 --> 00:09:57,274
You can see that when we're
trying to retrieve a flower

195
00:09:57,274 --> 00:09:59,488
for example, the features that are learned

196
00:09:59,488 --> 00:10:04,134
are really powerful for
doing similarity matching.

197
00:10:04,134 --> 00:10:07,049
We also have ConvNets that
are used for detection.

198
00:10:07,049 --> 00:10:10,557
So we're able to do a really
good job of localizing

199
00:10:10,557 --> 00:10:14,285
where in an image is, for
example, a bus, or a boat,

200
00:10:14,285 --> 00:10:17,705
and so on, and draw precise
bounding boxes around that.

201
00:10:17,705 --> 00:10:21,353
We're able to go even deeper
beyond that to do segmentation,

202
00:10:21,353 --> 00:10:23,145
right, and so these are now richer tasks

203
00:10:23,145 --> 00:10:26,112
where we're not looking
for just the bounding box

204
00:10:26,112 --> 00:10:27,958
but we're actually going
to label every pixel

205
00:10:27,958 --> 00:10:32,125
in the outline of, you know,
trees, and people, and so on.

206
00:10:34,126 --> 00:10:36,868
And these kind of algorithms are used in,

207
00:10:36,868 --> 00:10:38,864
for example, self-driving cars,

208
00:10:38,864 --> 00:10:42,066
and a lot of this is powered
by GPUs as I mentioned earlier,

209
00:10:42,066 --> 00:10:45,114
that's able to do parallel processing

210
00:10:45,114 --> 00:10:48,812
and able to efficiently
train and run these ConvNets.

211
00:10:48,812 --> 00:10:52,406
And so we have modern
powerful GPUs as well as ones

212
00:10:52,406 --> 00:10:55,634
that work in embedded
systems, for example,

213
00:10:55,634 --> 00:10:59,207
that you would use in a self-driving car.

214
00:10:59,207 --> 00:11:01,695
So we can also look at some
of the other applications

215
00:11:01,695 --> 00:11:03,399
that ConvNets are used for.

216
00:11:03,399 --> 00:11:06,227
So, face-recognition, right,
we can put an input image

217
00:11:06,227 --> 00:11:10,394
of a face and get out a
likelihood of who this person is.

218
00:11:12,626 --> 00:11:15,622
ConvNets are applied to video,
and so this is an example

219
00:11:15,622 --> 00:11:19,551
of a video network that
looks at both images

220
00:11:19,551 --> 00:11:21,902
as well as temporal information,

221
00:11:21,902 --> 00:11:25,951
and from there is able to classify videos.

222
00:11:25,951 --> 00:11:28,569
We're also able to do pose recognition.

223
00:11:28,569 --> 00:11:30,215
Being able to recognize, you know,

224
00:11:30,215 --> 00:11:32,770
shoulders, elbows, and different joints.

225
00:11:32,770 --> 00:11:37,577
And so here are some images
of our fabulous TA, Lane,

226
00:11:37,577 --> 00:11:42,234
in various kinds of pretty
non-standard human poses.

227
00:11:42,234 --> 00:11:45,791
But ConvNets are able
to do a pretty good job

228
00:11:45,791 --> 00:11:48,465
of pose recognition these days.

229
00:11:48,465 --> 00:11:51,741
They're also used in game playing.

230
00:11:51,741 --> 00:11:54,296
So some of the work in
reinforcement learning,

231
00:11:54,296 --> 00:11:56,509
deeper enforcement learning
that you may have seen,

232
00:11:56,509 --> 00:11:58,595
playing Atari games, and Go, and so on,

233
00:11:58,595 --> 00:12:02,981
and ConvNets are an important
part of all of these.

234
00:12:02,981 --> 00:12:06,656
Some other applications,
so they're being used for

235
00:12:06,656 --> 00:12:10,150
interpretation and
diagnosis of medical images,

236
00:12:10,150 --> 00:12:14,317
for classification of galaxies,
for street sign recognition.

237
00:12:18,059 --> 00:12:19,519
There's also whale recognition,

238
00:12:19,519 --> 00:12:22,342
this is from a recent Kaggle Challenge.

239
00:12:22,342 --> 00:12:26,067
We also have examples of
looking at aerial maps

240
00:12:26,067 --> 00:12:28,485
and being able to draw
out where are the streets

241
00:12:28,485 --> 00:12:29,999
on these maps, where are buildings,

242
00:12:29,999 --> 00:12:33,249
and being able to segment all of these.

243
00:12:35,089 --> 00:12:39,170
And then beyond recognition
of classification detection,

244
00:12:39,170 --> 00:12:41,587
these types of tasks, we also have tasks

245
00:12:41,587 --> 00:12:44,472
like image captioning,
where given an image,

246
00:12:44,472 --> 00:12:46,363
we want to write a sentence description

247
00:12:46,363 --> 00:12:48,644
about what's in the image.

248
00:12:48,644 --> 00:12:49,970
And so this is something
that we'll go into

249
00:12:49,970 --> 00:12:52,819
a little bit later in the class.

250
00:12:52,819 --> 00:12:57,169
And we also have, you know,
really, really fancy and cool

251
00:12:57,169 --> 00:13:01,251
kind of artwork that we can
do using neural networks.

252
00:13:01,251 --> 00:13:03,855
And so on the left is an
example of a deep dream,

253
00:13:03,855 --> 00:13:08,022
where we're able to take
images and kind of hallucinate

254
00:13:09,173 --> 00:13:12,412
different kinds of objects
and concepts in the image.

255
00:13:12,412 --> 00:13:16,274
There's also neural style type
work, where we take an image

256
00:13:16,274 --> 00:13:19,817
and we're able to re-render this image

257
00:13:19,817 --> 00:13:23,808
using a style of a particular
artist and artwork, right.

258
00:13:23,808 --> 00:13:27,899
And so here we can take, for
example, Van Gogh on the right,

259
00:13:27,899 --> 00:13:30,909
Starry Night, and use that to redraw

260
00:13:30,909 --> 00:13:33,370
our original image using that style.

261
00:13:33,370 --> 00:13:36,473
And Justin has done a lot of work in this

262
00:13:36,473 --> 00:13:38,239
and so if you guys are interested,

263
00:13:38,239 --> 00:13:42,163
these are images produced
by some of his code

264
00:13:42,163 --> 00:13:46,244
and you guys should talk
to him more about it.

265
00:13:46,244 --> 00:13:50,069
Okay, so basically, you know,
this is just a small sample

266
00:13:50,069 --> 00:13:52,727
of where ConvNets are being used today.

267
00:13:52,727 --> 00:13:55,289
But there's really a huge amount
that can be done with this,

268
00:13:55,289 --> 00:13:58,378
right, and so, you know,
for you guys' projects,

269
00:13:58,378 --> 00:14:00,624
sort of, you know, let
your imagination go wild

270
00:14:00,624 --> 00:14:04,605
and we're excited to see
what sorts of applications

271
00:14:04,605 --> 00:14:06,465
you can come up with.

272
00:14:06,465 --> 00:14:08,031
So today we're going to talk about

273
00:14:08,031 --> 00:14:10,307
how convolutional neural networks work.

274
00:14:10,307 --> 00:14:13,233
And again, same as with neural
networks, we're going to first

275
00:14:13,233 --> 00:14:16,904
talk about how they work
from a functional perspective

276
00:14:16,904 --> 00:14:18,668
without any of the brain analogies.

277
00:14:18,668 --> 00:14:22,835
And then we'll talk briefly
about some of these connections.

278
00:14:25,453 --> 00:14:28,361
Okay, so, last lecture, we talked about

279
00:14:28,361 --> 00:14:31,444
this idea of a fully connected layer.

280
00:14:32,878 --> 00:14:36,257
And how, you know, for
a fully connected layer

281
00:14:36,257 --> 00:14:39,373
what we're doing is we operate
on top of these vectors,

282
00:14:39,373 --> 00:14:43,218
right, and so let's say we
have, you know, an image,

283
00:14:43,218 --> 00:14:45,726
a 3D image, 32 by 32 by three,

284
00:14:45,726 --> 00:14:48,443
so some of the images that we
were looking at previously.

285
00:14:48,443 --> 00:14:51,548
We'll take that, we'll stretch
all of the pixels out, right,

286
00:14:51,548 --> 00:14:55,196
and then we have this
3072 dimensional vector,

287
00:14:55,196 --> 00:14:56,787
for example in this case.

288
00:14:56,787 --> 00:14:58,944
And then we have these weights, right,

289
00:14:58,944 --> 00:15:01,741
so we're going to multiply
this by a weight matrix.

290
00:15:01,741 --> 00:15:05,908
And so here for example our W
we're going to say is 10 by 3072.

291
00:15:07,264 --> 00:15:10,755
And then we're going
to get the activations,

292
00:15:10,755 --> 00:15:13,943
the output of this layer,
right, and so in this case,

293
00:15:13,943 --> 00:15:18,056
we take each of our 10 rows
and we do this dot product

294
00:15:18,056 --> 00:15:20,389
with 3072 dimensional input.

295
00:15:22,207 --> 00:15:24,835
And from there we get this one number

296
00:15:24,835 --> 00:15:27,892
that's kind of the value of that neuron.

297
00:15:27,892 --> 00:15:30,020
And so in this case we're going to have

298
00:15:30,020 --> 00:15:32,270
10 of these neuron outputs.

299
00:15:35,417 --> 00:15:38,355
And so a convolutional
layer, so the main difference

300
00:15:38,355 --> 00:15:39,988
between this and the fully connected layer

301
00:15:39,988 --> 00:15:41,203
that we've been talking about

302
00:15:41,203 --> 00:15:44,165
is that here we want to
preserve spatial structure.

303
00:15:44,165 --> 00:15:47,090
And so taking this 32 by 32 by three image

304
00:15:47,090 --> 00:15:49,838
that we had earlier, instead
of stretching this all out

305
00:15:49,838 --> 00:15:53,186
into one long vector, we're
now going to keep the structure

306
00:15:53,186 --> 00:15:57,750
of this image, right, this
three dimensional input.

307
00:15:57,750 --> 00:15:59,526
And then what we're going to do is

308
00:15:59,526 --> 00:16:01,910
our weights are going to
be these small filters,

309
00:16:01,910 --> 00:16:05,746
so in this case for example, a
five by five by three filter,

310
00:16:05,746 --> 00:16:07,212
and we're going to take this filter

311
00:16:07,212 --> 00:16:09,679
and we're going to slide
it over the image spatially

312
00:16:09,679 --> 00:16:13,153
and compute dot products
at every spatial location.

313
00:16:13,153 --> 00:16:17,320
And so we're going to go into
detail of exactly how this works.

314
00:16:18,668 --> 00:16:20,523
So, our filters, first of all,

315
00:16:20,523 --> 00:16:23,957
always extend the full
depth of the input volume.

316
00:16:23,957 --> 00:16:28,759
And so they're going to be
just a smaller spatial area,

317
00:16:28,759 --> 00:16:30,357
so in this case five by five, right,

318
00:16:30,357 --> 00:16:33,425
instead of our full 32
by 32 spatial input,

319
00:16:33,425 --> 00:16:37,536
but they're always going to go
through the full depth, right,

320
00:16:37,536 --> 00:16:42,499
so here we're going to
take five by five by three.

321
00:16:42,499 --> 00:16:44,619
And then we're going to take this filter

322
00:16:44,619 --> 00:16:46,996
and at a given spatial location

323
00:16:46,996 --> 00:16:49,046
we're going to do a dot product

324
00:16:49,046 --> 00:16:52,901
between this filter and
then a chunk of a image.

325
00:16:52,901 --> 00:16:54,492
So we're just going to overlay this filter

326
00:16:54,492 --> 00:16:56,998
on top of a spatial location in the image,

327
00:16:56,998 --> 00:16:58,636
right, and then do the dot product,

328
00:16:58,636 --> 00:17:02,665
the multiplication of each
element of that filter

329
00:17:02,665 --> 00:17:05,203
with each corresponding element
in that spatial location

330
00:17:05,203 --> 00:17:07,099
that we've just plopped it on top of.

331
00:17:07,099 --> 00:17:09,732
And then this is going
to give us a dot product.

332
00:17:09,733 --> 00:17:14,345
So in this case, we have
five times five times three,

333
00:17:14,345 --> 00:17:16,257
this is the number of multiplications

334
00:17:16,257 --> 00:17:18,755
that we're going to do,
right, plus the bias term.

335
00:17:18,755 --> 00:17:22,324
And so this is basically
taking our filter W

336
00:17:22,324 --> 00:17:26,491
and basically doing W transpose
times X and plus bias.

337
00:17:27,722 --> 00:17:30,299
So is that clear how this works?

338
00:17:30,299 --> 00:17:31,771
Yeah, question.

339
00:17:31,771 --> 00:17:34,521
[faint speaking]

340
00:17:35,656 --> 00:17:37,837
Yeah, so the question is,
when we do the dot product

341
00:17:37,837 --> 00:17:40,722
do we turn the five by five
by three into one vector?

342
00:17:40,722 --> 00:17:42,907
Yeah, in essence that's what you're doing.

343
00:17:42,907 --> 00:17:44,950
You can, I mean, you
can think of it as just

344
00:17:44,950 --> 00:17:47,996
plopping it on and doing the
element-wise multiplication

345
00:17:47,996 --> 00:17:50,523
at each location, but this is
going to give you the same result

346
00:17:50,523 --> 00:17:53,691
as if you stretched out
the filter at that point,

347
00:17:53,691 --> 00:17:56,211
stretched out the input
volume that it's laid over,

348
00:17:56,211 --> 00:17:57,891
and then took the dot product,

349
00:17:57,891 --> 00:18:01,111
and that's what's written
here, yeah, question.

350
00:18:01,111 --> 00:18:03,867
[faint speaking]

351
00:18:03,867 --> 00:18:05,305
Oh, this is, so the question is,

352
00:18:05,305 --> 00:18:07,997
any intuition for why
this is a W transpose?

353
00:18:07,997 --> 00:18:10,476
And this was just, not really,

354
00:18:10,476 --> 00:18:12,329
this is just the notation
that we have here

355
00:18:12,329 --> 00:18:15,978
to make the math work
out as a dot product.

356
00:18:15,978 --> 00:18:19,045
So it just depends on whether,
how you're representing W

357
00:18:19,045 --> 00:18:23,974
and whether in this case
if we look at the W matrix

358
00:18:23,974 --> 00:18:26,781
this happens to be each column
and so we're just taking

359
00:18:26,781 --> 00:18:29,593
the transpose to get a row out of it.

360
00:18:29,593 --> 00:18:31,989
But there's no intuition here,

361
00:18:31,989 --> 00:18:34,098
we're just taking the filters of W

362
00:18:34,098 --> 00:18:37,679
and we're stretching it
out into a one D vector,

363
00:18:37,679 --> 00:18:39,067
and in order for it to be a dot product

364
00:18:39,067 --> 00:18:42,862
it has to be like a one
by, one by N vector.

365
00:18:42,862 --> 00:18:45,612
[faint speaking]

366
00:18:48,263 --> 00:18:49,829
Okay, so the question is,

367
00:18:49,829 --> 00:18:53,996
is W here not five by five
by three, it's one by 75.

368
00:18:55,180 --> 00:18:57,307
So that's the case, right, if we're going

369
00:18:57,307 --> 00:18:59,882
to do this dot product
of W transpose times X,

370
00:18:59,882 --> 00:19:01,120
we have to stretch it out first

371
00:19:01,120 --> 00:19:02,550
before we do the dot product.

372
00:19:02,550 --> 00:19:05,312
So we take the five by five by three,

373
00:19:05,312 --> 00:19:06,462
and we just take all these values

374
00:19:06,462 --> 00:19:09,629
and stretch it out into a long vector.

375
00:19:10,913 --> 00:19:14,992
And so again, similar
to the other question,

376
00:19:14,992 --> 00:19:16,706
the actual operation that we're doing here

377
00:19:16,706 --> 00:19:18,691
is plopping our filter on top of

378
00:19:18,691 --> 00:19:20,568
a spatial location in the image

379
00:19:20,568 --> 00:19:23,375
and multiplying all of the
corresponding values together,

380
00:19:23,375 --> 00:19:25,906
but in order just to make it
kind of an easy expression

381
00:19:25,906 --> 00:19:27,527
similar to what we've seen before

382
00:19:27,527 --> 00:19:29,702
we can also just stretch
each of these out,

383
00:19:29,702 --> 00:19:32,707
make sure that dimensions
are transposed correctly

384
00:19:32,707 --> 00:19:35,061
so that it works out as a dot product.

385
00:19:35,061 --> 00:19:36,311
Yeah, question.

386
00:19:37,232 --> 00:19:40,740
[faint speaking]

387
00:19:40,740 --> 00:19:41,698
Okay, the question is,

388
00:19:41,698 --> 00:19:43,797
how do we slide the filter over the image.

389
00:19:43,797 --> 00:19:46,760
We'll go into that next, yes.

390
00:19:46,760 --> 00:19:49,510
[faint speaking]

391
00:19:52,071 --> 00:19:55,068
Okay, so the question is,
should we rotate the kernel

392
00:19:55,068 --> 00:19:58,111
by 180 degrees to better
match the convolution,

393
00:19:58,111 --> 00:20:00,178
the definition of a convolution.

394
00:20:00,178 --> 00:20:03,172
And so the answer is that
we'll also show the equation

395
00:20:03,172 --> 00:20:05,870
for this later, but
we're using convolution

396
00:20:05,870 --> 00:20:09,451
as kind of a looser definition
of what's happening.

397
00:20:09,451 --> 00:20:11,171
So for people from signal processing,

398
00:20:11,171 --> 00:20:13,101
what we are actually technically doing,

399
00:20:13,101 --> 00:20:14,925
if you want to call this a convolution,

400
00:20:14,925 --> 00:20:18,738
is we're convolving with the
flipped version of the filter.

401
00:20:18,738 --> 00:20:21,947
But for the most part, we
just don't worry about this

402
00:20:21,947 --> 00:20:24,689
and we just, yeah, do this operation

403
00:20:24,689 --> 00:20:27,983
and it's like a convolution in spirit.

404
00:20:27,983 --> 00:20:28,900
Okay, so...

405
00:20:31,890 --> 00:20:35,077
Okay, so we had a question
earlier, how do we, you know,

406
00:20:35,077 --> 00:20:37,246
slide this over all the spatial locations.

407
00:20:37,246 --> 00:20:38,526
Right, so what we're going to do is

408
00:20:38,526 --> 00:20:41,826
we're going to take this
filter, we're going to start

409
00:20:41,826 --> 00:20:45,237
at the upper left-hand
corner and basically center

410
00:20:45,237 --> 00:20:49,975
our filter on top of every
pixel in this input volume.

411
00:20:49,975 --> 00:20:53,654
And at every position, we're
going to do this dot product

412
00:20:53,654 --> 00:20:55,949
and this will produce one value

413
00:20:55,949 --> 00:20:57,511
in our output activation map.

414
00:20:57,511 --> 00:21:00,927
And so then we're going
to just slide this around.

415
00:21:00,927 --> 00:21:02,844
The simplest version
is just at every pixel

416
00:21:02,844 --> 00:21:05,359
we're going to do this
operation and fill in

417
00:21:05,359 --> 00:21:09,442
the corresponding point
in our output activation.

418
00:21:10,352 --> 00:21:14,166
You can see here that the
dimensions are not exactly

419
00:21:14,166 --> 00:21:15,532
what would happen, right,
if you're going to do this.

420
00:21:15,532 --> 00:21:17,748
I had 32 by 32 in the input

421
00:21:17,748 --> 00:21:20,126
and I'm having 28 by 28 in the output,

422
00:21:20,126 --> 00:21:22,920
and so we'll go into
examples later of the math

423
00:21:22,920 --> 00:21:26,364
of exactly how this is going
to work out dimension-wise,

424
00:21:26,364 --> 00:21:29,767
but basically you have a choice

425
00:21:29,767 --> 00:21:31,393
of how you're going to slide this,

426
00:21:31,393 --> 00:21:35,129
whether you go at every
pixel or whether you slide,

427
00:21:35,129 --> 00:21:39,437
let's say, you know, two
input values over at a time,

428
00:21:39,437 --> 00:21:41,326
two pixels over at a time,

429
00:21:41,326 --> 00:21:42,958
and so you can get different size outputs

430
00:21:42,958 --> 00:21:44,823
depending on how you choose to slide.

431
00:21:44,823 --> 00:21:48,990
But you're basically doing this
operation in a grid fashion.

432
00:21:50,180 --> 00:21:52,623
Okay, so what we just saw earlier,

433
00:21:52,623 --> 00:21:55,792
this is taking one filter, sliding it over

434
00:21:55,792 --> 00:21:58,141
all of the spatial locations in the image

435
00:21:58,141 --> 00:22:00,620
and then we're going to get
this activation map out, right,

436
00:22:00,620 --> 00:22:04,731
which is the value of that
filter at every spatial location.

437
00:22:04,731 --> 00:22:07,669
And so when we're dealing
with a convolutional layer,

438
00:22:07,669 --> 00:22:09,778
we want to work with
multiple filters, right,

439
00:22:09,778 --> 00:22:12,858
because each filter is kind
of looking for a specific

440
00:22:12,858 --> 00:22:16,250
type of template or concept
in the input volume.

441
00:22:16,250 --> 00:22:20,479
And so we're going to have
a set of multiple filters,

442
00:22:20,479 --> 00:22:22,623
and so here I'm going
to take a second filter,

443
00:22:22,623 --> 00:22:26,359
this green filter, which is
again five by five by three,

444
00:22:26,359 --> 00:22:30,059
I'm going to slide this over
all of the spatial locations

445
00:22:30,059 --> 00:22:33,258
in my input volume, and
then I'm going to get out

446
00:22:33,258 --> 00:22:37,425
this second green activation
map also of the same size.

447
00:22:40,081 --> 00:22:41,628
And we can do this for as many filters

448
00:22:41,628 --> 00:22:43,553
as we want to have in this layer.

449
00:22:43,553 --> 00:22:45,817
So for example, if we have six filters,

450
00:22:45,817 --> 00:22:47,871
six of these five by five filters,

451
00:22:47,871 --> 00:22:51,698
then we're going to get in
total six activation maps out.

452
00:22:51,698 --> 00:22:54,618
All of, so we're going
to get this output volume

453
00:22:54,618 --> 00:22:58,368
that's going to be
basically six by 28 by 28.

454
00:23:01,607 --> 00:23:03,609
Right, and so a preview
of how we're going to use

455
00:23:03,609 --> 00:23:06,689
these convolutional layers
in our convolutional network

456
00:23:06,689 --> 00:23:08,644
is that our ConvNet is
basically going to be

457
00:23:08,644 --> 00:23:11,152
a sequence of these convolutional layers

458
00:23:11,152 --> 00:23:13,769
stacked on top of each other,
same way as what we had

459
00:23:13,769 --> 00:23:16,676
with the simple linear layers
in their neural network.

460
00:23:16,676 --> 00:23:18,403
And then we're going to intersperse these

461
00:23:18,403 --> 00:23:19,474
with activation functions,

462
00:23:19,474 --> 00:23:23,057
so for example, a ReLU
activation function.

463
00:23:24,503 --> 00:23:28,670
Right, and so you're going to
get something like Conv, ReLU,

464
00:23:29,535 --> 00:23:31,257
and usually also some pooling layers,

465
00:23:31,257 --> 00:23:33,975
and then you're just going
to get a sequence of these

466
00:23:33,975 --> 00:23:36,965
each creating an output
that's now going to be

467
00:23:36,965 --> 00:23:40,465
the input to the next convolutional layer.

468
00:23:43,638 --> 00:23:46,552
Okay, and so each of these
layers, as I said earlier,

469
00:23:46,552 --> 00:23:49,305
has multiple filters, right, many filters.

470
00:23:49,305 --> 00:23:52,957
And each of the filter is
producing an activation map.

471
00:23:52,957 --> 00:23:55,633
And so when you look at
multiple of these layers

472
00:23:55,633 --> 00:23:58,141
stacked together in a ConvNet,
what ends up happening

473
00:23:58,141 --> 00:24:01,175
is you end up learning this
hierarching of filters,

474
00:24:01,175 --> 00:24:04,421
where the filters at the
earlier layers usually represent

475
00:24:04,421 --> 00:24:06,318
low-level features that
you're looking for.

476
00:24:06,318 --> 00:24:09,257
So things kind of like edges, right.

477
00:24:09,257 --> 00:24:10,272
And then at the mid-level,

478
00:24:10,272 --> 00:24:14,128
you're going to get more
complex kinds of features,

479
00:24:14,128 --> 00:24:16,478
so maybe it's looking more for things

480
00:24:16,478 --> 00:24:19,113
like corners and blobs and so on.

481
00:24:19,113 --> 00:24:20,602
And then at higher-level features,

482
00:24:20,602 --> 00:24:22,823
you're going to get
things that are starting

483
00:24:22,823 --> 00:24:25,852
to more resemble concepts than blobs.

484
00:24:25,852 --> 00:24:27,905
And we'll go into more
detail later in the class

485
00:24:27,905 --> 00:24:30,522
in how you can actually
visualize all these features

486
00:24:30,522 --> 00:24:33,165
and try and interpret what your network,

487
00:24:33,165 --> 00:24:35,561
what kinds of features
your network is learning.

488
00:24:35,561 --> 00:24:38,974
But the important thing for
now is just to understand

489
00:24:38,974 --> 00:24:40,378
that what these features end up being

490
00:24:40,378 --> 00:24:42,800
when you have a whole stack of these,

491
00:24:42,800 --> 00:24:46,967
is these types of simple
to more complex features.

492
00:24:48,305 --> 00:24:49,138
[faint speaking]

493
00:24:49,138 --> 00:24:49,971
Yeah.

494
00:24:50,984 --> 00:24:51,817
Oh, okay.

495
00:24:59,067 --> 00:25:01,124
Oh, okay, so the question
is, what's the intuition

496
00:25:01,124 --> 00:25:03,113
for increasing the depth each time.

497
00:25:03,113 --> 00:25:06,384
So here I had three filters
in the original layer

498
00:25:06,384 --> 00:25:08,814
and then six filters in the next layer.

499
00:25:08,814 --> 00:25:12,651
Right, and so this is
mostly a design choice.

500
00:25:12,651 --> 00:25:14,274
You know, people in practice have found

501
00:25:14,274 --> 00:25:17,255
certain types of these
configurations to work better.

502
00:25:17,255 --> 00:25:19,894
And so later on we'll go into
case studies of different

503
00:25:19,894 --> 00:25:23,185
kinds of convolutional
neural network architectures

504
00:25:23,185 --> 00:25:25,658
and design choices for these

505
00:25:25,658 --> 00:25:28,344
and why certain ones
work better than others.

506
00:25:28,344 --> 00:25:30,516
But yeah, basically the choice of,

507
00:25:30,516 --> 00:25:31,876
you're going to have many design choices

508
00:25:31,876 --> 00:25:33,238
in a convolutional neural network,

509
00:25:33,238 --> 00:25:34,948
the size of your filter, the stride,

510
00:25:34,948 --> 00:25:36,369
how many filters you have,

511
00:25:36,369 --> 00:25:39,611
and so we'll talk about
this all more later.

512
00:25:39,611 --> 00:25:41,246
Question.

513
00:25:41,246 --> 00:25:43,996
[faint speaking]

514
00:25:50,300 --> 00:25:53,691
Yeah, so the question is,
as we're sliding this filter

515
00:25:53,691 --> 00:25:56,364
over the image spatially it
looks like we're sampling

516
00:25:56,364 --> 00:26:00,177
the edges and corners less
than the other locations.

517
00:26:00,177 --> 00:26:01,676
Yeah, that's a really good point,

518
00:26:01,676 --> 00:26:04,483
and we'll talk I think in a few slides

519
00:26:04,483 --> 00:26:07,900
about how we try and compensate for that.

520
00:26:12,009 --> 00:26:15,592
Okay, so each of these
convolutional layers

521
00:26:16,870 --> 00:26:20,797
that we have stacked together,
we saw how we're starting

522
00:26:20,797 --> 00:26:23,877
with more simpler features
and then aggregating these

523
00:26:23,877 --> 00:26:26,228
into more complex features later on.

524
00:26:26,228 --> 00:26:28,343
And so in practice this is compatible

525
00:26:28,343 --> 00:26:32,549
with what Hubel and Wiesel
noticed in their experiments,

526
00:26:32,549 --> 00:26:35,895
right, that we had these simple cells

527
00:26:35,895 --> 00:26:37,406
at the earlier stages of processing,

528
00:26:37,406 --> 00:26:39,532
followed by more complex cells later on.

529
00:26:39,532 --> 00:26:42,865
And so even though we didn't explicitly

530
00:26:44,067 --> 00:26:46,455
force our ConvNet to learn
these kinds of features,

531
00:26:46,455 --> 00:26:48,295
in practice when you give it this type of

532
00:26:48,295 --> 00:26:51,623
hierarchical structure and
train it using backpropagation,

533
00:26:51,623 --> 00:26:55,041
these are the kinds of filters
that end up being learned.

534
00:26:55,041 --> 00:26:57,791
[faint speaking]

535
00:27:05,555 --> 00:27:07,116
Okay, so yeah, so the question is,

536
00:27:07,116 --> 00:27:10,979
what are we seeing in
these visualizations.

537
00:27:10,979 --> 00:27:13,321
And so, alright so, in
these visualizations, like,

538
00:27:13,321 --> 00:27:17,134
if we look at this Conv1, the
first convolutional layer,

539
00:27:17,134 --> 00:27:20,975
each of these grid, each part
of this grid is a one neuron.

540
00:27:20,975 --> 00:27:23,118
And so what we've visualized here

541
00:27:23,118 --> 00:27:26,701
is what the input looks
like that maximizes

542
00:27:27,893 --> 00:27:29,956
the activation of that particular neuron.

543
00:27:29,956 --> 00:27:31,826
So what sort of image you would get

544
00:27:31,826 --> 00:27:34,070
that would give you the largest value,

545
00:27:34,070 --> 00:27:36,594
make that neuron fire and
have the largest value.

546
00:27:36,594 --> 00:27:38,811
And so the way we do this is basically

547
00:27:38,811 --> 00:27:42,978
by doing backpropagation from
a particular neuron activation

548
00:27:44,415 --> 00:27:46,570
and seeing what in the input will trigger,

549
00:27:46,570 --> 00:27:48,848
will give you the highest
values of this neuron.

550
00:27:48,848 --> 00:27:50,730
And this is something
that we'll talk about

551
00:27:50,730 --> 00:27:53,276
in much more depth in a later lecture

552
00:27:53,276 --> 00:27:56,280
about how we create all
of these visualizations.

553
00:27:56,280 --> 00:27:59,124
But basically each element of these grids

554
00:27:59,124 --> 00:28:03,342
is showing what in the
input would look like

555
00:28:03,342 --> 00:28:06,775
that basically maximizes the
activation of the neuron.

556
00:28:06,775 --> 00:28:10,608
So in a sense, what is
the neuron looking for?

557
00:28:13,537 --> 00:28:18,490
Okay, so here is an example
of some of the activation maps

558
00:28:18,490 --> 00:28:19,835
produced by each filter, right.

559
00:28:19,835 --> 00:28:22,200
So we can visualize up here on the top

560
00:28:22,200 --> 00:28:26,025
we have this whole row of
example five by five filters,

561
00:28:26,025 --> 00:28:30,407
and so this is basically a real
case from a trained ConvNet

562
00:28:30,407 --> 00:28:34,490
where each of these is
what a five by five filter

563
00:28:35,593 --> 00:28:38,511
looks like, and then as we
convolve this over an image,

564
00:28:38,511 --> 00:28:41,197
so in this case this I think
it's like a corner of a car,

565
00:28:41,197 --> 00:28:44,346
the car light, what the
activation looks like.

566
00:28:44,346 --> 00:28:46,799
Right, and so here for example,

567
00:28:46,799 --> 00:28:49,449
if we look at this first
one, this red filter,

568
00:28:49,449 --> 00:28:51,330
filter like with a red box around it,

569
00:28:51,330 --> 00:28:53,412
we'll see that it's looking for,

570
00:28:53,412 --> 00:28:56,432
the template looks like an
edge, right, an oriented edge.

571
00:28:56,432 --> 00:28:58,050
And so if you slide it over the image,

572
00:28:58,050 --> 00:29:01,812
it'll have a high value,
a more white value

573
00:29:01,812 --> 00:29:06,601
where there are edges in
this type of orientation.

574
00:29:06,601 --> 00:29:10,563
And so each of these activation
maps is kind of the output

575
00:29:10,563 --> 00:29:12,358
of sliding one of these filters over

576
00:29:12,358 --> 00:29:16,444
and where these filters
are causing, you know,

577
00:29:16,444 --> 00:29:20,747
where this sort of template
is more present in the image.

578
00:29:20,747 --> 00:29:24,869
And so the reason we call
these convolutional is because

579
00:29:24,869 --> 00:29:27,221
this is related to the
convolution of two signals,

580
00:29:27,221 --> 00:29:29,153
and so someone pointed out earlier

581
00:29:29,153 --> 00:29:32,982
that this is basically this
convolution equation over here,

582
00:29:32,982 --> 00:29:35,333
for people who have
seen convolutions before

583
00:29:35,333 --> 00:29:37,340
in signal processing, and in practice

584
00:29:37,340 --> 00:29:38,927
it's actually more like a correlation

585
00:29:38,927 --> 00:29:41,583
where we're convolving
with the flipped version

586
00:29:41,583 --> 00:29:46,154
of the filter, but this
is kind of a subtlety,

587
00:29:46,154 --> 00:29:50,149
it's not really important for
the purposes of this class.

588
00:29:50,149 --> 00:29:52,292
But basically if you're
writing out what you're doing,

589
00:29:52,292 --> 00:29:55,450
it has an expression that
looks something like this,

590
00:29:55,450 --> 00:29:58,385
which is the standard
definition of a convolution.

591
00:29:58,385 --> 00:30:00,402
But this is basically
just taking a filter,

592
00:30:00,402 --> 00:30:02,432
sliding it spatially over the image

593
00:30:02,432 --> 00:30:06,432
and computing the dot
product at every location.

594
00:30:09,088 --> 00:30:11,977
Okay, so you know, as I
had mentioned earlier,

595
00:30:11,977 --> 00:30:14,208
like what our total
convolutional neural network

596
00:30:14,208 --> 00:30:17,278
is going to look like is we're
going to have an input image,

597
00:30:17,278 --> 00:30:19,693
and then we're going to pass it through

598
00:30:19,693 --> 00:30:21,633
this sequence of layers, right,

599
00:30:21,633 --> 00:30:23,915
where we're going to have a
convolutional layer first.

600
00:30:23,915 --> 00:30:28,236
We usually have our
non-linear layer after that.

601
00:30:28,236 --> 00:30:30,579
So ReLU is something
that's very commonly used

602
00:30:30,579 --> 00:30:33,608
that we're going to talk about more later.

603
00:30:33,608 --> 00:30:36,791
And then we have these Conv,
ReLU, Conv, ReLU layers,

604
00:30:36,791 --> 00:30:39,775
and then once in a while
we'll use a pooling layer

605
00:30:39,775 --> 00:30:41,244
that we'll talk about later as well

606
00:30:41,244 --> 00:30:45,411
that basically downsamples the
size of our activation maps.

607
00:30:47,300 --> 00:30:50,785
And then finally at the end
of this we'll take our last

608
00:30:50,785 --> 00:30:54,403
convolutional layer output
and then we're going to use

609
00:30:54,403 --> 00:30:56,872
a fully connected layer
that we've seen before,

610
00:30:56,872 --> 00:31:00,316
connected to all of these
convolutional outputs,

611
00:31:00,316 --> 00:31:03,011
and use that to get a final score function

612
00:31:03,011 --> 00:31:07,178
basically like what we've
already been working with.

613
00:31:08,445 --> 00:31:10,931
Okay, so now let's work out some examples

614
00:31:10,931 --> 00:31:14,181
of how the spatial dimensions work out.

615
00:31:18,363 --> 00:31:23,087
So let's take our 32 by 32
by three image as before,

616
00:31:23,087 --> 00:31:25,624
right, and we have our five
by five by three filter

617
00:31:25,624 --> 00:31:28,025
that we're going to slide over this image.

618
00:31:28,025 --> 00:31:29,816
And we're going to see how
we're going to use that

619
00:31:29,816 --> 00:31:34,337
to produce exactly this
28 by 28 activation map.

620
00:31:34,337 --> 00:31:37,644
So let's assume that we actually
have a seven by seven input

621
00:31:37,644 --> 00:31:39,104
just to be simpler, and let's assume

622
00:31:39,104 --> 00:31:41,505
we have a three by three filter.

623
00:31:41,505 --> 00:31:42,522
So what we're going to do is

624
00:31:42,522 --> 00:31:44,969
we're going to take this filter,

625
00:31:44,969 --> 00:31:47,418
plop it down in our
upper left-hand corner,

626
00:31:47,418 --> 00:31:50,253
right, and we're going to
multiply, do the dot product,

627
00:31:50,253 --> 00:31:53,169
multiply all these values
together to get our first value,

628
00:31:53,169 --> 00:31:54,918
and this is going to go into
the upper left-hand value

629
00:31:54,918 --> 00:31:56,764
of our activation map.

630
00:31:56,764 --> 00:31:58,217
Right, and then what
we're going to do next

631
00:31:58,217 --> 00:32:00,475
is we're just going to take this filter,

632
00:32:00,475 --> 00:32:02,389
slide it one position to the right,

633
00:32:02,389 --> 00:32:05,535
and then we're going to get
another value out from here.

634
00:32:05,535 --> 00:32:09,895
And so we can continue with
this to have another value,

635
00:32:09,895 --> 00:32:12,797
another, and in the end
what we're going to get

636
00:32:12,797 --> 00:32:14,528
is a five by five output, right,

637
00:32:14,528 --> 00:32:17,776
because what fit was
basically sliding this filter

638
00:32:17,776 --> 00:32:22,214
a total of five spatial
locations horizontally

639
00:32:22,214 --> 00:32:25,381
and five spatial locations vertically.

640
00:32:27,834 --> 00:32:29,414
Okay, so as I said before

641
00:32:29,414 --> 00:32:31,906
there's different kinds of
design choices that we can make.

642
00:32:31,906 --> 00:32:34,710
Right, so previously I
slid it at every single

643
00:32:34,710 --> 00:32:37,828
spatial location and the
interval at which I slide

644
00:32:37,828 --> 00:32:40,326
I'm going to call the stride.

645
00:32:40,326 --> 00:32:43,093
And so previously we
used the stride of one.

646
00:32:43,093 --> 00:32:44,567
And so now let's see what happens

647
00:32:44,567 --> 00:32:46,700
if we have a stride of two.

648
00:32:46,700 --> 00:32:48,625
Right, so now we're going
to take our first location

649
00:32:48,625 --> 00:32:51,898
the same as before, and
then we're going to skip

650
00:32:51,898 --> 00:32:55,527
this time two pixels over
and we're going to get

651
00:32:55,527 --> 00:32:58,944
our next value centered at this location.

652
00:33:00,773 --> 00:33:02,938
Right, and so now if
we use a stride of two,

653
00:33:02,938 --> 00:33:07,340
we have in total three
of these that can fit,

654
00:33:07,340 --> 00:33:11,257
and so we're going to get
a three by three output.

655
00:33:13,035 --> 00:33:15,955
Okay, and so what happens when
we have a stride of three,

656
00:33:15,955 --> 00:33:18,653
what's the output size of this?

657
00:33:18,653 --> 00:33:21,924
And so in this case, right, we have three,

658
00:33:21,924 --> 00:33:25,014
we slide it over by three again,

659
00:33:25,014 --> 00:33:27,905
and the problem is that here
it actually doesn't fit.

660
00:33:27,905 --> 00:33:29,827
Right, so we slide it over by three

661
00:33:29,827 --> 00:33:32,363
and now it doesn't fit
nicely within the image.

662
00:33:32,363 --> 00:33:35,721
And so what we in practice we
just, it just doesn't work.

663
00:33:35,721 --> 00:33:37,736
We don't do convolutions like this

664
00:33:37,736 --> 00:33:41,903
because it's going to lead to
asymmetric outputs happening.

665
00:33:46,095 --> 00:33:49,561
Right, and so just kind
of looking at the way

666
00:33:49,561 --> 00:33:52,464
that we computed how many, what
the output size is going to be,

667
00:33:52,464 --> 00:33:54,690
this actually can work into a nice formula

668
00:33:54,690 --> 00:33:57,687
where we take our
dimension of our input N,

669
00:33:57,687 --> 00:34:01,430
we have our filter size
F, we have our stride

670
00:34:01,430 --> 00:34:05,597
at which we're sliding along,
and our final output size,

671
00:34:06,992 --> 00:34:09,000
the spatial dimension of each output size

672
00:34:09,000 --> 00:34:12,850
is going to be N minus F
divided by the stride plus one,

673
00:34:12,850 --> 00:34:16,547
right, and you can kind of
see this as a, you know,

674
00:34:16,547 --> 00:34:18,619
if I'm going to take my
filter, let's say I fill it in

675
00:34:18,620 --> 00:34:21,373
at the very last possible
position that it can be in

676
00:34:21,373 --> 00:34:23,159
and then take all the pixels before that,

677
00:34:23,159 --> 00:34:27,326
how many instances of moving
by this stride can I fit in.

678
00:34:29,257 --> 00:34:32,546
Right, and so that's how this
equation kind of works out.

679
00:34:32,547 --> 00:34:35,422
And so as we saw before,
right, if we have N equal seven

680
00:34:35,422 --> 00:34:38,637
and F equals three, if
we want a stride of one

681
00:34:38,637 --> 00:34:40,795
we plug it into this
formula, we get five by five

682
00:34:40,795 --> 00:34:43,498
as we had before, and the
same thing we had for two.

683
00:34:43,498 --> 00:34:47,665
And with a stride of three,
this doesn't really work out.

684
00:34:50,288 --> 00:34:52,870
And so in practice it's actually common

685
00:34:52,870 --> 00:34:56,203
to zero pad the borders in order to make

686
00:34:57,134 --> 00:34:59,552
the size work out to what we want it to.

687
00:34:59,552 --> 00:35:01,504
And so this is kind of
related to a question earlier,

688
00:35:01,504 --> 00:35:04,140
which is what do we do,
right, at the corners.

689
00:35:04,140 --> 00:35:06,145
And so what in practice happens is

690
00:35:06,145 --> 00:35:09,222
we're going to actually pad
our input image with zeros

691
00:35:09,222 --> 00:35:12,449
and so now you're going to
be able to place a filter

692
00:35:12,449 --> 00:35:16,303
centered at the upper
right-hand pixel location

693
00:35:16,303 --> 00:35:19,134
of your actual input image.

694
00:35:19,134 --> 00:35:22,784
Okay, so here's a question,
so who can tell me

695
00:35:22,784 --> 00:35:25,988
if I have my same input, seven by seven,

696
00:35:25,988 --> 00:35:27,635
three by three filter, stride one,

697
00:35:27,635 --> 00:35:29,942
but now I pad with a one pixel border,

698
00:35:29,942 --> 00:35:33,654
what's the size of my output going to be?

699
00:35:33,654 --> 00:35:36,285
[faint speaking]

700
00:35:36,285 --> 00:35:39,535
So, I heard some sixes, heard some sev,

701
00:35:41,211 --> 00:35:44,847
so remember we have this
formula that we had before.

702
00:35:44,847 --> 00:35:49,342
So if we plug in N is equal
to seven, F is equal to three,

703
00:35:49,342 --> 00:35:52,594
right, and then our
stride is equal to one.

704
00:35:52,594 --> 00:35:57,264
So what we actually get, so
actually this is giving us

705
00:35:57,264 --> 00:36:01,522
seven, four, so seven
minus three is four,

706
00:36:01,522 --> 00:36:03,256
divided by one plus one is five.

707
00:36:03,256 --> 00:36:04,998
And so this is what we had before.

708
00:36:04,998 --> 00:36:06,707
So we actually need to adjust
this formula a little bit,

709
00:36:06,707 --> 00:36:09,139
right, so this was actually,
this formula is the case

710
00:36:09,139 --> 00:36:12,161
where we don't have zero padded pixels.

711
00:36:12,161 --> 00:36:16,328
But if we do pad it, then if
you now take your new output

712
00:36:17,347 --> 00:36:19,050
and you slide it along,

713
00:36:19,050 --> 00:36:22,128
you'll see that actually
seven of the filters fit,

714
00:36:22,128 --> 00:36:24,173
so you get a seven by seven output.

715
00:36:24,173 --> 00:36:26,467
And plugging in our
original formula, right,

716
00:36:26,467 --> 00:36:30,178
so our N now is not seven, it's nine,

717
00:36:30,178 --> 00:36:33,385
so if we go back here
we have N equals nine

718
00:36:33,385 --> 00:36:37,001
minus a filter size of
three, which gives six.

719
00:36:37,001 --> 00:36:39,298
Right, divided by our
stride, which is one,

720
00:36:39,298 --> 00:36:42,253
and so still six, and then
plus one we get seven.

721
00:36:42,253 --> 00:36:43,807
Right, and so once you've padded it

722
00:36:43,807 --> 00:36:47,974
you want to incorporate this
padding into your formula.

723
00:36:49,739 --> 00:36:51,646
Yes, question.

724
00:36:51,646 --> 00:36:54,396
[faint speaking]

725
00:37:00,717 --> 00:37:03,589
Seven, okay, so the question is,

726
00:37:03,589 --> 00:37:06,114
what's the actual output of the size,

727
00:37:06,114 --> 00:37:08,962
is it seven by seven or
seven by seven by three?

728
00:37:08,962 --> 00:37:11,935
The output is going to be seven by seven

729
00:37:11,935 --> 00:37:14,495
by the number of filters that you have.

730
00:37:14,495 --> 00:37:18,162
So remember each filter is
going to do a dot product

731
00:37:18,162 --> 00:37:21,320
through the entire depth
of your input volume.

732
00:37:21,320 --> 00:37:23,801
But then that's going to
produce one number, right,

733
00:37:23,801 --> 00:37:27,968
so each filter is, let's
see if we can go back here.

734
00:37:29,540 --> 00:37:32,938
Each filter is producing
a one by seven by seven

735
00:37:32,938 --> 00:37:37,124
in this case activation map
output, and so the depth

736
00:37:37,124 --> 00:37:40,493
is going to be the number
of filters that we have.

737
00:37:40,493 --> 00:37:43,243
[faint speaking]

738
00:37:50,161 --> 00:37:53,411
Sorry, let me just, one second go back.

739
00:37:55,136 --> 00:37:57,350
Okay, can you repeat your question again?

740
00:37:57,350 --> 00:38:00,267
[muffled speaking]

741
00:38:12,936 --> 00:38:16,011
Okay, so the question is, how
does this connect to before

742
00:38:16,011 --> 00:38:19,735
when we had a 32 by 32
by three input, right.

743
00:38:19,735 --> 00:38:21,830
So our input had depth
and here in this example

744
00:38:21,830 --> 00:38:24,721
I'm showing a 2D example with no depth.

745
00:38:24,721 --> 00:38:27,226
And so yeah, I'm showing
this for simplicity

746
00:38:27,226 --> 00:38:30,373
but in practice you're going to take your,

747
00:38:30,373 --> 00:38:32,334
you're going to multiply
throughout the entire depth

748
00:38:32,334 --> 00:38:34,188
as we had before, so you're going to,

749
00:38:34,188 --> 00:38:36,765
your filter is going to be
in this case a three be three

750
00:38:36,765 --> 00:38:39,850
spatial filter by whatever
input depth that you had.

751
00:38:39,850 --> 00:38:43,183
So three by three by three in this case.

752
00:38:44,059 --> 00:38:46,854
Yeah, everything else stays the same.

753
00:38:46,854 --> 00:38:48,390
Yes, question.

754
00:38:48,390 --> 00:38:51,307
[muffled speaking]

755
00:38:53,529 --> 00:38:55,731
Yeah, so the question
is, does the zero padding

756
00:38:55,731 --> 00:38:58,664
add some sort of extraneous
features at the corners?

757
00:38:58,664 --> 00:39:01,446
And yeah, so I mean, we're
doing our best to still,

758
00:39:01,446 --> 00:39:03,779
get some value and do, like,

759
00:39:04,721 --> 00:39:06,289
process that region of the image,

760
00:39:06,289 --> 00:39:10,343
and so zero padding is
kind of one way to do this,

761
00:39:10,343 --> 00:39:12,999
where I guess we can, we are detecting

762
00:39:12,999 --> 00:39:16,097
part of this template in this region.

763
00:39:16,097 --> 00:39:18,323
There's also other ways
to do this that, you know,

764
00:39:18,323 --> 00:39:20,729
you can try and like,
mirror the values here

765
00:39:20,729 --> 00:39:23,615
or extend them, and so it
doesn't have to be zero padding,

766
00:39:23,615 --> 00:39:26,530
but in practice this is one
thing that works reasonably.

767
00:39:26,530 --> 00:39:29,930
And so, yeah, so there is a
little bit of kind of artifacts

768
00:39:29,930 --> 00:39:31,503
at the edge and we sort of just,

769
00:39:31,503 --> 00:39:33,834
you do your best to deal with it.

770
00:39:33,834 --> 00:39:36,486
And in practice this works reasonably.

771
00:39:36,486 --> 00:39:39,503
I think there was another question.

772
00:39:39,503 --> 00:39:41,283
Yeah, question.

773
00:39:41,283 --> 00:39:44,033
[faint speaking]

774
00:39:48,015 --> 00:39:51,535
So if we have non-square
images, do we ever use a stride

775
00:39:51,535 --> 00:39:54,330
that's different
horizontally and vertically?

776
00:39:54,330 --> 00:39:57,039
So, I mean, there's nothing
stopping you from doing that,

777
00:39:57,039 --> 00:39:59,816
you could, but in practice we just usually

778
00:39:59,816 --> 00:40:02,841
take the same stride, we
usually operate square regions

779
00:40:02,841 --> 00:40:04,909
and we just, yeah we usually just

780
00:40:04,909 --> 00:40:08,238
take the same stride everywhere
and it's sort of like,

781
00:40:08,238 --> 00:40:10,218
in a sense it's a little bit like,

782
00:40:10,218 --> 00:40:12,900
it's a little bit like the
resolution at which you're,

783
00:40:12,900 --> 00:40:14,699
you know, looking at this image,

784
00:40:14,699 --> 00:40:18,100
and so usually there's kind
of, you might want to match

785
00:40:18,100 --> 00:40:20,693
sort of your horizontal
and vertical resolutions.

786
00:40:20,693 --> 00:40:22,886
But, yeah, so in practice you could

787
00:40:22,886 --> 00:40:25,553
but really people don't do that.

788
00:40:26,555 --> 00:40:28,373
Okay, another question.

789
00:40:28,373 --> 00:40:31,453
[faint speaking]

790
00:40:31,453 --> 00:40:33,710
So the question is, why
do we do zero padding?

791
00:40:33,710 --> 00:40:35,247
So the way we do zero padding

792
00:40:35,247 --> 00:40:39,376
is to maintain the same
input size as we had before.

793
00:40:39,376 --> 00:40:41,297
Right, so we started with seven by seven,

794
00:40:41,297 --> 00:40:44,182
and if we looked at just
starting your filter

795
00:40:44,182 --> 00:40:46,756
from the upper left-hand
corner, filling everything in,

796
00:40:46,756 --> 00:40:49,019
right, then we get a smaller size output,

797
00:40:49,019 --> 00:40:53,186
but we would like to maintain
our full size output.

798
00:40:56,276 --> 00:40:57,109
Okay, so,

799
00:40:59,251 --> 00:41:02,664
yeah, so we saw how padding
can basically help you

800
00:41:02,664 --> 00:41:05,527
maintain the size of the
output that you want,

801
00:41:05,527 --> 00:41:08,237
as well as apply your filter at these,

802
00:41:08,237 --> 00:41:10,753
like, corner regions and edge regions.

803
00:41:10,753 --> 00:41:13,142
And so in general in terms of choosing,

804
00:41:13,142 --> 00:41:15,772
you know, your stride, your
filter, your filter size,

805
00:41:15,772 --> 00:41:18,998
your stride size, zero
padding, what's common to see

806
00:41:18,998 --> 00:41:22,405
is filters of size three
by three, five by five,

807
00:41:22,405 --> 00:41:25,427
seven by seven, these are
pretty common filter sizes.

808
00:41:25,427 --> 00:41:27,908
And so each of these, for three by three

809
00:41:27,908 --> 00:41:30,232
you will want to zero pad with one

810
00:41:30,232 --> 00:41:33,567
in order to maintain
the same spatial size.

811
00:41:33,567 --> 00:41:35,618
If you're going to do five by five,

812
00:41:35,618 --> 00:41:37,470
you can work out the math,
but it's going to come out

813
00:41:37,470 --> 00:41:39,422
to you want to zero pad by two.

814
00:41:39,422 --> 00:41:43,505
And then for seven you
want to zero pad by three.

815
00:41:44,722 --> 00:41:47,316
Okay, and so again you
know, the motivation

816
00:41:47,316 --> 00:41:50,167
for doing this type of zero padding

817
00:41:50,167 --> 00:41:52,184
and trying to maintain
the input size, right,

818
00:41:52,184 --> 00:41:54,500
so we kind of alluded to this before,

819
00:41:54,500 --> 00:41:58,667
but if you have multiple of
these layers stacked together...

820
00:42:03,354 --> 00:42:07,015
So if you have multiple of
these layers stacked together

821
00:42:07,015 --> 00:42:08,689
you'll see that, you know,
if we don't do this kind of

822
00:42:08,689 --> 00:42:10,566
zero padding, or any kind of padding,

823
00:42:10,566 --> 00:42:12,848
we're going to really
quickly shrink the size

824
00:42:12,848 --> 00:42:14,602
of the outputs that we have.

825
00:42:14,602 --> 00:42:16,616
Right, and so this is not
something that we want.

826
00:42:16,616 --> 00:42:19,302
Like, you can imagine if you
have a pretty deep network

827
00:42:19,302 --> 00:42:23,293
then very quickly your, the
size of your activation maps

828
00:42:23,293 --> 00:42:25,907
is going to shrink to
something very small.

829
00:42:25,907 --> 00:42:28,790
And this is bad both because
we're kind of losing out

830
00:42:28,790 --> 00:42:29,990
on some of this information, right,

831
00:42:29,990 --> 00:42:34,272
now you're using a much
smaller number of values

832
00:42:34,272 --> 00:42:36,578
in order to represent your original image,

833
00:42:36,578 --> 00:42:38,568
so you don't want that.

834
00:42:38,568 --> 00:42:41,318
And then at the same time also as

835
00:42:42,983 --> 00:42:46,249
we talked about this earlier, your also kind of

836
00:42:46,249 --> 00:42:48,589
losing sort of some of
this edge information,

837
00:42:48,589 --> 00:42:49,923
corner information that each time

838
00:42:49,923 --> 00:42:53,590
we're losing out and
shrinking that further.

839
00:42:55,203 --> 00:42:57,310
Okay, so let's go through
a couple more examples

840
00:42:57,310 --> 00:43:00,060
of computing some of these sizes.

841
00:43:00,991 --> 00:43:03,018
So let's say that we have an input volume

842
00:43:03,018 --> 00:43:05,611
which is 32 by 32 by three.

843
00:43:05,611 --> 00:43:09,244
And here we have 10 five by five filters.

844
00:43:09,244 --> 00:43:12,388
Let's use stride one and pad two.

845
00:43:12,388 --> 00:43:13,550
And so who can tell me

846
00:43:13,550 --> 00:43:16,717
what's the output volume size of this?

847
00:43:18,188 --> 00:43:20,353
So you can think about
the formula earlier.

848
00:43:20,353 --> 00:43:21,728
Sorry, what was it?

849
00:43:21,728 --> 00:43:23,263
[faint speaking]

850
00:43:23,263 --> 00:43:26,180
32 by 32 by 10, yes that's correct.

851
00:43:27,572 --> 00:43:30,324
And so the way we can see this, right,

852
00:43:30,324 --> 00:43:33,707
is so we have our input size, F is 32.

853
00:43:33,707 --> 00:43:36,401
Then in this case we want to augment it

854
00:43:36,401 --> 00:43:38,396
by the padding that we added onto this.

855
00:43:38,396 --> 00:43:41,209
So we padded it two in
each dimension, right,

856
00:43:41,209 --> 00:43:44,122
so we're actually going to get,
total width and total height's

857
00:43:44,122 --> 00:43:47,181
going to be 32 plus four on each side.

858
00:43:47,181 --> 00:43:49,992
And then minus our filter size five,

859
00:43:49,992 --> 00:43:51,716
divided by one plus one and we get 32.

860
00:43:51,716 --> 00:43:55,883
So our output is going to
be 32 by 32 for each filter.

861
00:43:57,213 --> 00:44:00,302
And then we have 10 filters total,

862
00:44:00,302 --> 00:44:02,193
so we have 10 of these activation maps,

863
00:44:02,193 --> 00:44:06,360
and our total output volume
is going to be 32 by 32 by 10.

864
00:44:08,244 --> 00:44:10,040
Okay, next question,

865
00:44:10,040 --> 00:44:14,478
so what's the number of
parameters in this layer?

866
00:44:14,478 --> 00:44:18,145
So remember we have 10
five by five filters.

867
00:44:19,769 --> 00:44:22,698
[faint speaking]

868
00:44:22,698 --> 00:44:26,365
I kind of heard something,
but it was quiet.

869
00:44:29,407 --> 00:44:31,240
Can you guys speak up?

870
00:44:32,809 --> 00:44:36,226
250, okay so I heard 250, which is close,

871
00:44:37,829 --> 00:44:40,018
but remember that we're
also, our input volume,

872
00:44:40,018 --> 00:44:42,149
each of these filters
goes through by depth.

873
00:44:42,149 --> 00:44:44,237
So maybe this wasn't clearly written here

874
00:44:44,237 --> 00:44:46,855
because each of the filters
is five by five spatially,

875
00:44:46,855 --> 00:44:50,300
but implicitly we also have
the depth in here, right.

876
00:44:50,300 --> 00:44:52,835
It's going to go through the whole volume.

877
00:44:52,835 --> 00:44:55,876
So I heard, yeah, 750 I heard.

878
00:44:55,876 --> 00:44:57,430
Almost there, this is
kind of a trick question

879
00:44:57,430 --> 00:44:59,445
'cause also remember
we usually always have

880
00:44:59,445 --> 00:45:03,374
a bias term, right, so
in practice each filter

881
00:45:03,374 --> 00:45:08,084
has five by five by three
weights, plus our one bias term,

882
00:45:08,084 --> 00:45:10,483
we have 76 parameters per filter,

883
00:45:10,483 --> 00:45:12,609
and then we have 10 of these total,

884
00:45:12,609 --> 00:45:15,609
and so there's 760 total parameters.

885
00:45:18,412 --> 00:45:20,464
Okay, and so here's just a summary

886
00:45:20,464 --> 00:45:24,105
of the convolutional layer
that you guys can read

887
00:45:24,105 --> 00:45:25,890
a little bit more carefully later on.

888
00:45:25,890 --> 00:45:28,924
But we have our input volume
of a certain dimension,

889
00:45:28,924 --> 00:45:31,137
we have all of these choice,
we have our filters, right,

890
00:45:31,137 --> 00:45:33,751
where we have number of
filters, the filter size,

891
00:45:33,751 --> 00:45:36,170
the stride of the size,
the amount of zero padding,

892
00:45:36,170 --> 00:45:38,682
and you basically can use all of these,

893
00:45:38,682 --> 00:45:41,167
go through the computations
that we talked about earlier

894
00:45:41,167 --> 00:45:43,866
in order to find out what
your output volume is actually

895
00:45:43,866 --> 00:45:48,033
going to be and how many total
parameters that you have.

896
00:45:49,282 --> 00:45:51,951
And so some common settings of this.

897
00:45:51,951 --> 00:45:55,526
You know, we talked earlier
about common filter sizes

898
00:45:55,526 --> 00:45:58,555
of three by three, five by five.

899
00:45:58,555 --> 00:46:01,739
Stride is usually one
and two is pretty common.

900
00:46:01,739 --> 00:46:04,505
And then your padding P is
going to be whatever fits,

901
00:46:04,505 --> 00:46:08,518
like, whatever will
preserve your spatial extent

902
00:46:08,518 --> 00:46:10,401
is what's common.

903
00:46:10,401 --> 00:46:13,623
And then the total number of filters K,

904
00:46:13,623 --> 00:46:16,759
usually we use powers of two
just to be nice, so, you know,

905
00:46:16,759 --> 00:46:19,009
32, 64, 128 and so on, 512,

906
00:46:19,903 --> 00:46:24,505
these are pretty common
numbers that you'll see.

907
00:46:24,505 --> 00:46:26,511
And just as an aside,

908
00:46:26,511 --> 00:46:29,488
we can also do a one by one convolution,

909
00:46:29,488 --> 00:46:31,557
this still makes perfect sense where

910
00:46:31,557 --> 00:46:33,459
given a one by one convolution

911
00:46:33,459 --> 00:46:35,852
we still slide it over
each spatial extent,

912
00:46:35,852 --> 00:46:37,700
but now, you know, the spatial region

913
00:46:37,700 --> 00:46:38,888
is not really five by five

914
00:46:38,888 --> 00:46:42,574
it's just kind of the
trivial case of one by one,

915
00:46:42,574 --> 00:46:44,819
but we are still having this filter

916
00:46:44,819 --> 00:46:46,680
go through the entire depth.

917
00:46:46,680 --> 00:46:48,273
Right, so this is going
to be a dot product

918
00:46:48,273 --> 00:46:52,053
through the entire depth
of your input volume.

919
00:46:52,053 --> 00:46:55,067
And so the output here, right,
if we have an input volume

920
00:46:55,067 --> 00:46:59,804
of 56 by 56 by 64 depth and
we're going to do one by one

921
00:46:59,804 --> 00:47:03,895
convolution with 32 filters,
then our output is going to be

922
00:47:03,895 --> 00:47:07,062
56 by 56 by our number of filters, 32.

923
00:47:10,076 --> 00:47:13,419
Okay, and so here's an example
of a convolutional layer

924
00:47:13,419 --> 00:47:16,210
in TORCH, a deep learning framework.

925
00:47:16,210 --> 00:47:18,747
And so you'll see that,
you know, last lecture

926
00:47:18,747 --> 00:47:20,799
we talked about how you can go into these

927
00:47:20,799 --> 00:47:23,427
deep learning frameworks,
you can see these definitions

928
00:47:23,427 --> 00:47:25,017
of each layer, right,
where they have kind of

929
00:47:25,017 --> 00:47:26,665
the forward pass and the backward pass

930
00:47:26,665 --> 00:47:28,667
implemented for each layer.

931
00:47:28,667 --> 00:47:30,638
And so you'll see convolutions,

932
00:47:30,638 --> 00:47:33,562
spatial convolution is going
to be just one of these,

933
00:47:33,562 --> 00:47:35,360
and then the arguments
that it's going to take

934
00:47:35,360 --> 00:47:39,890
are going to be all of these
design choices of, you know,

935
00:47:39,890 --> 00:47:42,781
I mean, I guess your
input and output sizes,

936
00:47:42,781 --> 00:47:45,759
but also your choices of
like your kernel width,

937
00:47:45,759 --> 00:47:50,161
your kernel size, padding,
and these kinds of things.

938
00:47:50,161 --> 00:47:53,226
Right, and so if we look at
another framework, Caffe,

939
00:47:53,226 --> 00:47:54,737
you'll see something very similar,

940
00:47:54,737 --> 00:47:56,950
where again now when you're
defining your network

941
00:47:56,950 --> 00:48:00,880
you define networks in Caffe
using this kind of, you know,

942
00:48:00,880 --> 00:48:03,982
proto text file where you're specifying

943
00:48:03,982 --> 00:48:07,160
each of your design choices for your layer

944
00:48:07,160 --> 00:48:09,279
and you can see for a convolutional layer

945
00:48:09,279 --> 00:48:11,806
will say things like, you
know, the number of outputs

946
00:48:11,806 --> 00:48:14,077
that we have, this is going
to be the number of filters

947
00:48:14,077 --> 00:48:18,244
for Caffe, as well as the kernel
size and stride and so on.

948
00:48:21,144 --> 00:48:24,701
Okay, and so I guess before I go on,

949
00:48:24,701 --> 00:48:26,512
any questions about convolution,

950
00:48:26,512 --> 00:48:29,512
how the convolution operation works?

951
00:48:30,868 --> 00:48:32,161
Yes, question.

952
00:48:32,161 --> 00:48:34,911
[faint speaking]

953
00:48:51,604 --> 00:48:52,940
Yeah, so the question is,

954
00:48:52,940 --> 00:48:55,902
what's the intuition behind
how you choose your stride.

955
00:48:55,902 --> 00:49:00,037
And so at one sense it's
kind of the resolution

956
00:49:00,037 --> 00:49:02,401
at which you slide it on, and
usually the reason behind this

957
00:49:02,401 --> 00:49:04,870
is because when we have a larger stride

958
00:49:04,870 --> 00:49:06,908
what we end up getting as the output

959
00:49:06,908 --> 00:49:09,258
is a down sampled image, right,

960
00:49:09,258 --> 00:49:13,425
and so what this downsampled
image lets us have is both,

961
00:49:14,715 --> 00:49:17,202
it's a way, it's kind of
like pooling in a sense

962
00:49:17,202 --> 00:49:19,352
but it's just a different
and sometimes works better

963
00:49:19,352 --> 00:49:23,025
way of doing pooling is one
of the intuitions behind this,

964
00:49:23,025 --> 00:49:27,192
'cause you get the same effect
of downsampling your image,

965
00:49:28,183 --> 00:49:32,691
and then also as you're doing
this you're reducing the size

966
00:49:32,691 --> 00:49:35,502
of the activation maps
that you're dealing with

967
00:49:35,502 --> 00:49:38,892
at each layer, right, and so
this also affects later on

968
00:49:38,892 --> 00:49:40,825
the total number of
parameters that you have

969
00:49:40,825 --> 00:49:44,973
because for example at the
end of all your Conv layers,

970
00:49:44,973 --> 00:49:48,611
now you might put on fully
connected layers on top,

971
00:49:48,611 --> 00:49:51,092
for example, and now the
fully connected layer's

972
00:49:51,092 --> 00:49:53,362
going to be connected to every value

973
00:49:53,362 --> 00:49:56,099
of your convolutional output, right,

974
00:49:56,099 --> 00:49:59,058
and so a smaller one will
give you smaller number

975
00:49:59,058 --> 00:50:02,596
of parameters, and so now
you can get into, like,

976
00:50:02,596 --> 00:50:04,960
basically thinking about
trade offs of, you know,

977
00:50:04,960 --> 00:50:08,025
number of parameters you
have, the size of your model,

978
00:50:08,025 --> 00:50:10,076
overfitting, things
like that, and so yeah,

979
00:50:10,076 --> 00:50:11,371
these are kind of some of the things

980
00:50:11,371 --> 00:50:15,538
that you want to think about
with choosing your stride.

981
00:50:18,496 --> 00:50:22,421
Okay, so now if we look a
little bit at kind of the,

982
00:50:22,421 --> 00:50:25,356
you know, brain neuron view
of a convolutional layer,

983
00:50:25,356 --> 00:50:29,627
similar to what we
looked at for the neurons

984
00:50:29,627 --> 00:50:31,599
in the last lecture.

985
00:50:31,599 --> 00:50:35,610
So what we have is that
at every spatial location,

986
00:50:35,610 --> 00:50:37,488
we take a dot product between a filter

987
00:50:37,488 --> 00:50:39,216
and a specific part of the image, right,

988
00:50:39,216 --> 00:50:42,077
and we get one number out from here.

989
00:50:42,077 --> 00:50:43,506
And so this is the same idea

990
00:50:43,506 --> 00:50:46,042
of doing these types
of dot products, right,

991
00:50:46,042 --> 00:50:49,270
taking your input, weighting
it by these Ws, right,

992
00:50:49,270 --> 00:50:53,659
values of your filter, these
weights that are the synapses,

993
00:50:53,659 --> 00:50:55,227
and getting a value out.

994
00:50:55,227 --> 00:50:57,559
But the main difference
here is just that now

995
00:50:57,559 --> 00:50:59,517
your neuron has local connectivity.

996
00:50:59,517 --> 00:51:02,191
So instead of being connected
to the entire input,

997
00:51:02,191 --> 00:51:06,536
it's just looking at a local
region spatially of your image.

998
00:51:06,536 --> 00:51:08,701
And so this looks at a local region

999
00:51:08,701 --> 00:51:11,859
and then now you're going
to get kind of, you know,

1000
00:51:11,859 --> 00:51:15,111
this, how much this
neuron is being triggered

1001
00:51:15,111 --> 00:51:17,500
at every spatial location in your image.

1002
00:51:17,500 --> 00:51:19,631
Right, so now you preserve
the spatial structure

1003
00:51:19,631 --> 00:51:22,485
and you can say, you
know, be able to reason

1004
00:51:22,485 --> 00:51:26,652
on top of these kinds of
activation maps in later layers.

1005
00:51:30,048 --> 00:51:33,181
And just a little bit of terminology,

1006
00:51:33,181 --> 00:51:36,931
again for, you know, we have
this five by five filter,

1007
00:51:36,931 --> 00:51:40,015
we can also call this a
five by five receptive field

1008
00:51:40,015 --> 00:51:41,726
for the neuron, because this is,

1009
00:51:41,726 --> 00:51:44,300
the receptive field is
basically the, you know,

1010
00:51:44,300 --> 00:51:46,535
input field that this field of vision

1011
00:51:46,535 --> 00:51:48,518
that this neuron is receiving, right,

1012
00:51:48,518 --> 00:51:51,758
and so that's just another common term

1013
00:51:51,758 --> 00:51:53,315
that you'll hear for this.

1014
00:51:53,315 --> 00:51:55,743
And then again remember each
of these five by five filters

1015
00:51:55,743 --> 00:51:58,442
we're sliding them over
the spatial locations

1016
00:51:58,442 --> 00:52:00,506
but they're the same set of weights,

1017
00:52:00,506 --> 00:52:03,089
they share the same parameters.

1018
00:52:05,440 --> 00:52:08,045
Okay, and so, you know, as we talked about

1019
00:52:08,045 --> 00:52:09,485
what we're going to get at this output

1020
00:52:09,485 --> 00:52:11,200
is going to be this volume, right,

1021
00:52:11,200 --> 00:52:13,874
where spatially we have,
you know, let's say 28 by 28

1022
00:52:13,874 --> 00:52:16,373
and then our number of
filters is the depth.

1023
00:52:16,373 --> 00:52:18,357
And so for example with five filters,

1024
00:52:18,357 --> 00:52:20,663
what we're going to
get out is this 3D grid

1025
00:52:20,663 --> 00:52:23,381
that's 28 by 28 by five.

1026
00:52:23,381 --> 00:52:26,606
And so if you look at the filters across

1027
00:52:26,606 --> 00:52:30,654
in one spatial location
of the activation volume

1028
00:52:30,654 --> 00:52:33,825
and going through depth
these five neurons,

1029
00:52:33,825 --> 00:52:36,003
all of these neurons,

1030
00:52:36,003 --> 00:52:37,408
basically the way you can interpret this

1031
00:52:37,408 --> 00:52:39,471
is they're all looking at the same region

1032
00:52:39,471 --> 00:52:40,590
in the input volume,

1033
00:52:40,590 --> 00:52:42,344
but they're just looking
for different things, right.

1034
00:52:42,344 --> 00:52:43,953
So they're different filters

1035
00:52:43,953 --> 00:52:48,120
applied to the same spatial
location in the image.

1036
00:52:49,152 --> 00:52:52,391
And so just a reminder
again kind of comparing

1037
00:52:52,391 --> 00:52:55,443
with the fully connected layer
that we talked about earlier.

1038
00:52:55,443 --> 00:52:57,805
In that case, right, if we
look at each of the neurons

1039
00:52:57,805 --> 00:53:01,607
in our activation or
output, each of the neurons

1040
00:53:01,607 --> 00:53:03,983
was connected to the
entire stretched out input,

1041
00:53:03,983 --> 00:53:06,637
so it looked at the
entire full input volume,

1042
00:53:06,637 --> 00:53:08,802
compared to now where each one

1043
00:53:08,802 --> 00:53:12,805
just looks at this local spatial region.

1044
00:53:12,805 --> 00:53:14,255
Question.

1045
00:53:14,255 --> 00:53:17,088
[muffled talking]

1046
00:53:22,648 --> 00:53:25,054
Okay, so the question
is, within a given layer,

1047
00:53:25,054 --> 00:53:28,137
are the filters completely symmetric?

1048
00:53:30,158 --> 00:53:34,325
So what do you mean by
symmetric exactly, I guess?

1049
00:53:42,200 --> 00:53:46,389
Right, so okay, so the
filters, are the filters doing,

1050
00:53:46,389 --> 00:53:50,556
they're doing the same dimension,
the same calculation, yes.

1051
00:53:52,784 --> 00:53:54,444
Okay, so is there anything different

1052
00:53:54,444 --> 00:53:58,122
other than they have the
same parameter values?

1053
00:53:58,122 --> 00:53:59,624
No, so you're exactly right,

1054
00:53:59,624 --> 00:54:02,690
we're just taking a filter
with a given set of, you know,

1055
00:54:02,690 --> 00:54:04,973
five by five by three parameter values,

1056
00:54:04,973 --> 00:54:07,335
and we just slide this
in exactly the same way

1057
00:54:07,335 --> 00:54:11,502
over the entire input volume
to get an activation map.

1058
00:54:14,596 --> 00:54:17,668
Okay, so you know, we've
gone into a lot of detail

1059
00:54:17,668 --> 00:54:20,592
in what these convolutional
layers look like,

1060
00:54:20,592 --> 00:54:22,372
and so now I'm just going to go briefly

1061
00:54:22,372 --> 00:54:25,196
through the other layers that we have

1062
00:54:25,196 --> 00:54:28,802
that form this entire
convolutional network.

1063
00:54:28,802 --> 00:54:31,071
Right, so remember again,
we have convolutional layers

1064
00:54:31,071 --> 00:54:33,365
interspersed with pooling
layers once in a while

1065
00:54:33,365 --> 00:54:36,653
as well as these non-linearities.

1066
00:54:36,653 --> 00:54:39,017
Okay, so what the pooling layers do

1067
00:54:39,017 --> 00:54:41,112
is that they make the representations

1068
00:54:41,112 --> 00:54:42,716
smaller and more manageable, right,

1069
00:54:42,716 --> 00:54:45,107
so we talked about this earlier with

1070
00:54:45,107 --> 00:54:48,683
someone asked a question of
why we would want to make

1071
00:54:48,683 --> 00:54:51,562
the representation smaller.

1072
00:54:51,562 --> 00:54:54,919
And so this is again for it to have fewer,

1073
00:54:54,919 --> 00:54:58,343
it effects the number of
parameters that you have at the end

1074
00:54:58,343 --> 00:55:01,614
as well as basically does some, you know,

1075
00:55:01,614 --> 00:55:04,425
invariance over a given region.

1076
00:55:04,425 --> 00:55:05,830
And so what the pooling layer does

1077
00:55:05,830 --> 00:55:09,460
is it does exactly just downsamples,

1078
00:55:09,460 --> 00:55:13,415
and it takes your input
volume, so for example,

1079
00:55:13,415 --> 00:55:17,762
224 by 224 by 64, and
spatially downsamples this.

1080
00:55:17,762 --> 00:55:20,861
So in the end you'll get out 112 by 112.

1081
00:55:20,861 --> 00:55:23,429
And it's important to note
this doesn't do anything

1082
00:55:23,429 --> 00:55:26,588
in the depth, right, we're
only pooling spatially.

1083
00:55:26,588 --> 00:55:30,168
So the number of, your input depth

1084
00:55:30,168 --> 00:55:33,215
is going to be the same
as your output depth.

1085
00:55:33,215 --> 00:55:36,948
And so, for example, a common
way to do this is max pooling.

1086
00:55:36,948 --> 00:55:41,317
So in this case our pooling
layer also has a filter size

1087
00:55:41,317 --> 00:55:44,289
and this filter size is
going to be the region

1088
00:55:44,289 --> 00:55:46,825
at which we pool over,
right, so in this case

1089
00:55:46,825 --> 00:55:50,562
if we have two by two filters,
we're going to slide this,

1090
00:55:50,562 --> 00:55:53,572
and so, here, we also have
stride two in this case,

1091
00:55:53,572 --> 00:55:54,884
so we're going to take this filter

1092
00:55:54,884 --> 00:55:58,999
and we're going to slide
it along our input volume

1093
00:55:58,999 --> 00:56:01,672
in exactly the same way
as we did for convolution.

1094
00:56:01,672 --> 00:56:03,619
But here instead of
doing these dot products,

1095
00:56:03,619 --> 00:56:06,205
we just take the maximum value

1096
00:56:06,205 --> 00:56:08,338
of the input volume in that region.

1097
00:56:08,338 --> 00:56:11,645
Right, so here if we
look at the red values,

1098
00:56:11,645 --> 00:56:13,416
the value of that will
be six is the largest.

1099
00:56:13,416 --> 00:56:15,655
If we look at the greens
it's going to give an eight,

1100
00:56:15,655 --> 00:56:18,655
and then we have a three and a four.

1101
00:56:23,433 --> 00:56:24,931
Yes, question.

1102
00:56:24,931 --> 00:56:27,848
[muffled speaking]

1103
00:56:29,010 --> 00:56:31,304
Yeah, so the question is, is
it typical to set up the stride

1104
00:56:31,304 --> 00:56:34,406
so that there isn't an overlap?

1105
00:56:34,406 --> 00:56:36,850
And yeah, so for the pooling layers it is,

1106
00:56:36,850 --> 00:56:38,196
I think the more common thing to do

1107
00:56:38,196 --> 00:56:41,256
is to have them not have any overlap,

1108
00:56:41,256 --> 00:56:44,688
and I guess the way you
can think about this

1109
00:56:44,688 --> 00:56:48,322
is basically we just want to downsample

1110
00:56:48,322 --> 00:56:50,560
and so it makes sense to
kind of look at this region

1111
00:56:50,560 --> 00:56:52,977
and just get one value
to represent this region

1112
00:56:52,977 --> 00:56:55,874
and then just look at the
next region and so on.

1113
00:56:55,874 --> 00:56:57,379
Yeah, question.

1114
00:56:57,379 --> 00:57:00,129
[faint speaking]

1115
00:57:02,415 --> 00:57:04,328
Okay, so the question
is, why is max pooling

1116
00:57:04,328 --> 00:57:05,710
better than just taking the,

1117
00:57:05,710 --> 00:57:07,636
doing something like average pooling?

1118
00:57:07,636 --> 00:57:10,058
Yes, that's a good point,
like, average pooling

1119
00:57:10,058 --> 00:57:12,017
is also something that you can do,

1120
00:57:12,017 --> 00:57:15,417
and intuition behind why
max pooling is commonly used

1121
00:57:15,417 --> 00:57:17,979
is that it can have
this interpretation of,

1122
00:57:17,979 --> 00:57:21,471
you know, if this is, these
are activations of my neurons,

1123
00:57:21,471 --> 00:57:23,770
right, and so each value is kind of

1124
00:57:23,770 --> 00:57:26,972
how much this neuron
fired in this location,

1125
00:57:26,972 --> 00:57:29,253
how much this filter
fired in this location.

1126
00:57:29,253 --> 00:57:31,927
And so you can think of
max pooling as saying,

1127
00:57:31,927 --> 00:57:36,094
you know, giving a signal of
how much did this filter fire

1128
00:57:37,000 --> 00:57:39,133
at any location in this image.

1129
00:57:39,133 --> 00:57:41,264
Right, and if we're
thinking about detecting,

1130
00:57:41,264 --> 00:57:44,022
you know, doing recognition,

1131
00:57:44,022 --> 00:57:46,535
this might make some intuitive
sense where you're saying,

1132
00:57:46,535 --> 00:57:49,034
well, you know, whether a
light or whether some aspect

1133
00:57:49,034 --> 00:57:52,206
of your image that you're looking for,

1134
00:57:52,206 --> 00:57:53,990
whether it happens anywhere in this region

1135
00:57:53,990 --> 00:57:57,073
we want to fire at with a high value.

1136
00:57:57,940 --> 00:57:59,129
Question.

1137
00:57:59,129 --> 00:58:02,046
[muffled speaking]

1138
00:58:06,200 --> 00:58:08,746
Yeah, so the question is,
since pooling and stride

1139
00:58:08,746 --> 00:58:10,959
both have the same effect of downsampling,

1140
00:58:10,959 --> 00:58:14,223
can you just use stride
instead of pooling and so on?

1141
00:58:14,223 --> 00:58:16,513
Yeah, and so in practice I think

1142
00:58:16,513 --> 00:58:19,771
looking at more recent
neural network architectures

1143
00:58:19,771 --> 00:58:23,103
people have begun to use stride more

1144
00:58:23,103 --> 00:58:27,704
in order to do the downsampling
instead of just pooling.

1145
00:58:27,704 --> 00:58:30,837
And I think this gets into
things like, you know,

1146
00:58:30,837 --> 00:58:32,801
also like fractional strides
and things that you can do.

1147
00:58:32,801 --> 00:58:36,968
But in practice this in a
sense maybe has a little bit

1148
00:58:38,721 --> 00:58:41,892
better way to get better
results using that, so.

1149
00:58:41,892 --> 00:58:44,125
Yeah, so I think using
stride is definitely,

1150
00:58:44,125 --> 00:58:47,292
you can do it and people are doing it.

1151
00:58:49,672 --> 00:58:52,505
Okay, so let's see, where were we.

1152
00:58:53,544 --> 00:58:56,553
Okay, so yeah, so with
these pooling layers,

1153
00:58:56,553 --> 00:59:00,358
so again, there's right, some
design choices that you make,

1154
00:59:00,358 --> 00:59:04,057
you take this input volume of W by H by D,

1155
00:59:04,057 --> 00:59:07,446
and then you're going to
set your hyperparameters

1156
00:59:07,446 --> 00:59:10,107
for design choices of your filter size

1157
00:59:10,107 --> 00:59:12,376
or the spatial extent over
which you are pooling,

1158
00:59:12,376 --> 00:59:15,101
as well as your stride, and
then you can again compute

1159
00:59:15,101 --> 00:59:18,676
your output volume using the
same equation that you used

1160
00:59:18,676 --> 00:59:21,325
earlier for convolution, it
still applies here, right,

1161
00:59:21,325 --> 00:59:24,030
so we still have our W total extent

1162
00:59:24,030 --> 00:59:27,780
minus filter size divided
by stride plus one.

1163
00:59:30,880 --> 00:59:33,217
Okay, and so just one other thing to note,

1164
00:59:33,217 --> 00:59:37,172
it's also, typically people
don't really use zero padding

1165
00:59:37,172 --> 00:59:39,647
for the pooling layers
because you're just trying

1166
00:59:39,647 --> 00:59:41,262
to do a direct downsampling, right,

1167
00:59:41,262 --> 00:59:43,003
so there isn't this problem of like,

1168
00:59:43,003 --> 00:59:44,423
applying a filter at the corner

1169
00:59:44,423 --> 00:59:47,045
and having some part of the
filter go off your input volume.

1170
00:59:47,045 --> 00:59:49,526
And so for pooling we don't
usually have to worry about this

1171
00:59:49,526 --> 00:59:52,939
and we just directly downsample.

1172
00:59:52,939 --> 00:59:56,304
And so some common settings
for the pooling layer

1173
00:59:56,304 --> 01:00:00,890
is a filter size of two by
two or three by three strides.

1174
01:00:00,890 --> 01:00:03,609
Two by two, you know, you can have,

1175
01:00:03,609 --> 01:00:06,269
also you can still have
pooling of two by two

1176
01:00:06,269 --> 01:00:09,091
even with a filter size of three by three,

1177
01:00:09,091 --> 01:00:10,789
I think someone asked that earlier,

1178
01:00:10,789 --> 01:00:14,956
but in practice it's pretty
common just to have two by two.

1179
01:00:17,958 --> 01:00:21,527
Okay, so now we've talked about
these convolutional layers,

1180
01:00:21,527 --> 01:00:24,370
the ReLU layers were the
same as what we had before

1181
01:00:24,370 --> 01:00:29,174
with the, you know, just
the base neural network

1182
01:00:29,174 --> 01:00:31,492
that we talked about last lecture.

1183
01:00:31,492 --> 01:00:33,899
So we intersperse these and
then we have a pooling layer

1184
01:00:33,899 --> 01:00:37,865
every once in a while when we
feel like downsampling, right.

1185
01:00:37,865 --> 01:00:41,080
And then the last thing is that at the end

1186
01:00:41,080 --> 01:00:43,766
we want to have a fully connected layer.

1187
01:00:43,766 --> 01:00:46,210
And so this will be just exactly the same

1188
01:00:46,210 --> 01:00:48,790
as the fully connected layers
that you've seen before.

1189
01:00:48,790 --> 01:00:50,506
So in this case now what we do

1190
01:00:50,506 --> 01:00:54,173
is we take the convolutional
network output,

1191
01:00:55,775 --> 01:00:57,503
at the last layer we have some volume,

1192
01:00:57,503 --> 01:01:00,421
so we're going to have width
by height by some depth,

1193
01:01:00,421 --> 01:01:01,626
and we just take all of these

1194
01:01:01,626 --> 01:01:04,212
and we essentially just
stretch these out, right.

1195
01:01:04,212 --> 01:01:06,322
And so now we're going
to get the same kind of,

1196
01:01:06,322 --> 01:01:08,795
you know, basically 1D
input that we're used to

1197
01:01:08,795 --> 01:01:12,962
for a vanilla neural network,
and then we're going to apply

1198
01:01:14,153 --> 01:01:16,275
this fully connected layer on top,

1199
01:01:16,275 --> 01:01:17,715
so now we're going to have connections

1200
01:01:17,715 --> 01:01:21,715
to every one of these
convolutional map outputs.

1201
01:01:22,676 --> 01:01:24,786
And so what you can think
of this is basically,

1202
01:01:24,786 --> 01:01:26,457
now instead of preserving, you know,

1203
01:01:26,457 --> 01:01:28,616
before we were preserving
spatial structure,

1204
01:01:28,616 --> 01:01:30,897
right, and so but at the
last layer at the end,

1205
01:01:30,897 --> 01:01:32,982
we want to aggregate all of this together

1206
01:01:32,982 --> 01:01:34,787
and we want to reason basically on top of

1207
01:01:34,787 --> 01:01:37,081
all of this as we had before.

1208
01:01:37,081 --> 01:01:40,518
And so what you get from that is just our

1209
01:01:40,518 --> 01:01:43,185
score outputs as we had earlier.

1210
01:01:45,744 --> 01:01:47,232
Okay, so--

1211
01:01:47,232 --> 01:01:48,411
- [Student] This is
sort of a silly question

1212
01:01:48,411 --> 01:01:49,911
about this visual.

1213
01:01:52,345 --> 01:01:56,123
Like what are the 16 pixels
that are on the far right,

1214
01:01:56,123 --> 01:02:00,357
like what should be interpreting those as?

1215
01:02:00,357 --> 01:02:02,584
- Okay, so the question
is, what are the 16 pixels

1216
01:02:02,584 --> 01:02:04,238
that are on the far
right, do you mean the--

1217
01:02:04,238 --> 01:02:05,888
- [Student] Like that column of--

1218
01:02:05,888 --> 01:02:07,566
- [Instructor] Oh, each column.

1219
01:02:07,566 --> 01:02:09,425
- [Student] The column
on the far right, yeah.

1220
01:02:09,425 --> 01:02:11,031
- [Instructor] The green
ones or the black ones?

1221
01:02:11,031 --> 01:02:12,679
- [Student] The ones labeled pool.

1222
01:02:12,679 --> 01:02:14,472
- The one with hold on, pool.

1223
01:02:14,472 --> 01:02:16,312
Oh, okay, yeah, so the question is

1224
01:02:16,312 --> 01:02:20,566
how do we interpret this column,
right, for example at pool.

1225
01:02:20,566 --> 01:02:24,645
And so what we're showing
here is each of these columns

1226
01:02:24,645 --> 01:02:28,376
is the output activation maps, right,

1227
01:02:28,376 --> 01:02:29,887
the output from one of these layers.

1228
01:02:29,887 --> 01:02:34,028
And so starting from the
beginning, we have our car,

1229
01:02:34,028 --> 01:02:35,465
after the convolutional layer

1230
01:02:35,465 --> 01:02:37,795
we now have these activation
maps of each of the filters

1231
01:02:37,795 --> 01:02:40,537
slid spatially over the input image.

1232
01:02:40,537 --> 01:02:42,484
Then we pass that through a ReLU,

1233
01:02:42,484 --> 01:02:45,306
so you can see the values
coming out from there.

1234
01:02:45,306 --> 01:02:46,636
And then going all the way over,

1235
01:02:46,636 --> 01:02:48,652
and so what you get for the pooling layer

1236
01:02:48,652 --> 01:02:51,850
is that it's really just taking

1237
01:02:51,850 --> 01:02:54,183
the output of the ReLU layer

1238
01:02:55,548 --> 01:02:58,270
that came just before it
and then it's pooling it.

1239
01:02:58,270 --> 01:03:00,337
So it's going to downsample it,

1240
01:03:00,337 --> 01:03:01,711
right, and then it's going to take

1241
01:03:01,711 --> 01:03:04,510
the max value in each filter location.

1242
01:03:04,510 --> 01:03:06,548
And so now if you look at
this pool layer output,

1243
01:03:06,548 --> 01:03:09,209
like, for example, the last
one that you were mentioning,

1244
01:03:09,209 --> 01:03:11,704
it looks the same as this ReLU output

1245
01:03:11,704 --> 01:03:15,871
except that it's downsampled
and that it has this kind of

1246
01:03:17,311 --> 01:03:18,952
max value at every spatial location

1247
01:03:18,952 --> 01:03:20,550
and so that's the minor difference

1248
01:03:20,550 --> 01:03:22,534
that you'll see between those two.

1249
01:03:22,534 --> 01:03:25,451
[distant speaking]

1250
01:03:30,523 --> 01:03:32,559
So the question is, now this looks like

1251
01:03:32,559 --> 01:03:34,654
just a very small amount
of information, right,

1252
01:03:34,654 --> 01:03:36,991
so how can it know to
classify it from here?

1253
01:03:36,991 --> 01:03:39,553
And so the way that you
should think about this

1254
01:03:39,553 --> 01:03:41,886
is that each of these values

1255
01:03:43,365 --> 01:03:46,052
inside one of these pool
outputs is actually,

1256
01:03:46,052 --> 01:03:49,004
it's the accumulation of all
the processing that you've done

1257
01:03:49,004 --> 01:03:50,696
throughout this entire network, right.

1258
01:03:50,696 --> 01:03:53,890
So it's at the very top of your hierarchy,

1259
01:03:53,890 --> 01:03:55,458
and so each actually represents

1260
01:03:55,458 --> 01:03:57,602
kind of a higher level concept.

1261
01:03:57,602 --> 01:04:01,197
So we saw before, you know,
for example, Hubel and Wiesel

1262
01:04:01,197 --> 01:04:03,571
and building up these
hierarchical filters,

1263
01:04:03,571 --> 01:04:07,466
where at the bottom level
we're looking for edges, right,

1264
01:04:07,466 --> 01:04:10,257
or things like very simple
structures, like edges.

1265
01:04:10,257 --> 01:04:13,872
And so after your convolutional layer

1266
01:04:13,872 --> 01:04:15,991
the outputs that you see
here in this first column

1267
01:04:15,991 --> 01:04:20,541
is basically how much do
specific, for example, edges,

1268
01:04:20,541 --> 01:04:22,700
fire at different locations in the image.

1269
01:04:22,700 --> 01:04:25,268
But then as you go through
you're going to get more complex,

1270
01:04:25,268 --> 01:04:26,915
it's looking for more
complex things, right,

1271
01:04:26,915 --> 01:04:28,955
and so the next convolutional layer

1272
01:04:28,955 --> 01:04:31,205
is going to fire at how much, you know,

1273
01:04:31,205 --> 01:04:34,674
let's say certain kinds of
corners show up in the image,

1274
01:04:34,674 --> 01:04:36,080
right, because it's reasoning.

1275
01:04:36,080 --> 01:04:37,957
Its input is not the original image,

1276
01:04:37,957 --> 01:04:42,627
its input is the output, it's
already the edge maps, right,

1277
01:04:42,627 --> 01:04:44,560
so it's reasoning on top of edge maps,

1278
01:04:44,560 --> 01:04:47,680
and so that allows it to get more complex,

1279
01:04:47,680 --> 01:04:49,052
detect more complex things.

1280
01:04:49,052 --> 01:04:50,756
And so by the time you get all the way up

1281
01:04:50,756 --> 01:04:53,212
to this last pooling layer,
each value is representing

1282
01:04:53,212 --> 01:04:57,379
how much a relatively complex
sort of template is firing.

1283
01:04:58,765 --> 01:05:01,613
Right, and so because of
that now you can just have

1284
01:05:01,613 --> 01:05:04,460
a fully connected layer,
you're just aggregating

1285
01:05:04,460 --> 01:05:07,228
all of this information together to get,

1286
01:05:07,228 --> 01:05:10,511
you know, a score for your class.

1287
01:05:10,511 --> 01:05:13,134
So each of these values is how much

1288
01:05:13,134 --> 01:05:17,051
a pretty complicated
complex concept is firing.

1289
01:05:19,043 --> 01:05:20,460
Question.

1290
01:05:20,460 --> 01:05:23,239
[faint speaking]

1291
01:05:23,239 --> 01:05:24,744
So the question is, when
do you know you've done

1292
01:05:24,744 --> 01:05:27,296
enough pooling to do the classification?

1293
01:05:27,296 --> 01:05:30,722
And the answer is you just try and see.

1294
01:05:30,722 --> 01:05:34,639
So in practice, you know,
these are all design choices

1295
01:05:34,639 --> 01:05:37,430
and you can think about this
a little bit intuitively,

1296
01:05:37,430 --> 01:05:41,203
right, like you want to pool
but if you pool too much

1297
01:05:41,203 --> 01:05:43,585
you're going to have very few values

1298
01:05:43,585 --> 01:05:45,960
representing your entire image and so on,

1299
01:05:45,960 --> 01:05:47,701
so it's just kind of a trade off.

1300
01:05:47,701 --> 01:05:50,581
Something reasonable
versus people have tried

1301
01:05:50,581 --> 01:05:52,290
a lot of different configurations

1302
01:05:52,290 --> 01:05:54,614
so you'll probably cross validate, right,

1303
01:05:54,614 --> 01:05:57,049
and try over different pooling sizes,

1304
01:05:57,049 --> 01:05:59,492
different filter sizes,
different number of layers,

1305
01:05:59,492 --> 01:06:02,926
and see what works best for
your problem because yeah,

1306
01:06:02,926 --> 01:06:05,350
like every problem with
different data is going to,

1307
01:06:05,350 --> 01:06:07,423
you know, different set of these sorts

1308
01:06:07,423 --> 01:06:10,340
of hyperparameters might work best.

1309
01:06:13,388 --> 01:06:16,836
Okay, so last thing, just
wanted to point you guys

1310
01:06:16,836 --> 01:06:19,753
to this demo of training a ConvNet,

1311
01:06:21,171 --> 01:06:24,143
which was created by Andre Karpathy,

1312
01:06:24,143 --> 01:06:26,424
the originator of this class.

1313
01:06:26,424 --> 01:06:28,755
And so he wrote up this demo

1314
01:06:28,755 --> 01:06:33,000
where you can basically
train a ConvNet on CIFAR-10,

1315
01:06:33,000 --> 01:06:35,874
the dataset that we've seen
before, right, with 10 classes.

1316
01:06:35,874 --> 01:06:39,341
And what's nice about
this demo is you can,

1317
01:06:39,341 --> 01:06:42,014
it basically plots for you
what each of these filters

1318
01:06:42,014 --> 01:06:44,260
look like, what the
activation maps look like.

1319
01:06:44,260 --> 01:06:46,137
So some of the images I showed earlier

1320
01:06:46,137 --> 01:06:47,835
were taken from this demo.

1321
01:06:47,835 --> 01:06:50,048
And so you can go try it
out, play around with it,

1322
01:06:50,048 --> 01:06:52,640
and you know, just go through
and try and get a sense

1323
01:06:52,640 --> 01:06:55,268
for what these activation maps look like.

1324
01:06:55,268 --> 01:06:57,134
And just one thing to note,

1325
01:06:57,134 --> 01:07:00,578
usually the first layer
activation maps are,

1326
01:07:00,578 --> 01:07:01,709
you can interpret them, right,

1327
01:07:01,709 --> 01:07:03,606
because they're operating
directly on the input image

1328
01:07:03,606 --> 01:07:05,532
so you can see what these templates mean.

1329
01:07:05,532 --> 01:07:07,784
As you get to higher level layers

1330
01:07:07,784 --> 01:07:08,975
it starts getting really hard,

1331
01:07:08,975 --> 01:07:11,163
like how do you actually
interpret what do these mean.

1332
01:07:11,163 --> 01:07:13,877
So for the most part it's
just hard to interpret

1333
01:07:13,877 --> 01:07:15,398
so you shouldn't, you know, don't worry

1334
01:07:15,398 --> 01:07:17,535
if you can't really make
sense of what's going on.

1335
01:07:17,535 --> 01:07:19,604
But it's still nice just
to see the entire flow

1336
01:07:19,604 --> 01:07:22,271
and what outputs are coming out.

1337
01:07:23,985 --> 01:07:27,313
Okay, so in summary, so
today we talked about

1338
01:07:27,313 --> 01:07:29,946
how convolutional neural networks work,

1339
01:07:29,946 --> 01:07:31,257
how they're basically stacks

1340
01:07:31,257 --> 01:07:34,204
of these convolutional and pooling layers

1341
01:07:34,204 --> 01:07:38,291
followed by fully connected
layers at the end.

1342
01:07:38,291 --> 01:07:40,940
There's been a trend towards
having smaller filters

1343
01:07:40,940 --> 01:07:44,069
and deeper architectures,
so we'll talk more

1344
01:07:44,069 --> 01:07:47,364
about case studies for
some of these later on.

1345
01:07:47,364 --> 01:07:49,576
There's also been a trend
towards getting rid of these

1346
01:07:49,576 --> 01:07:52,215
pooling and fully
connected layers entirely.

1347
01:07:52,215 --> 01:07:55,275
So just keeping these, just
having, you know, Conv layers,

1348
01:07:55,275 --> 01:07:57,391
very deep networks of Conv layers,

1349
01:07:57,391 --> 01:08:01,058
so again we'll discuss
all of this later on.

1350
01:08:01,898 --> 01:08:04,591
And then typical architectures
again look like this,

1351
01:08:04,591 --> 01:08:06,300
you know, as we had earlier.

1352
01:08:06,300 --> 01:08:08,964
Conv, ReLU for some N number of steps

1353
01:08:08,964 --> 01:08:10,821
followed by a pool every once in a while,

1354
01:08:10,821 --> 01:08:13,197
this whole thing repeated
some number of times,

1355
01:08:13,197 --> 01:08:16,314
and then followed by fully
connected ReLU layers

1356
01:08:16,314 --> 01:08:18,987
that we saw earlier, you know, one or two

1357
01:08:18,987 --> 01:08:20,287
or just a few of these,

1358
01:08:20,287 --> 01:08:24,060
and then a softmax at the
end for your class scores.

1359
01:08:24,060 --> 01:08:26,100
And so, you know, some typical values

1360
01:08:26,100 --> 01:08:29,183
you might have N up to five of these.

1361
01:08:30,408 --> 01:08:33,144
You're going to have pretty deep layers

1362
01:08:33,145 --> 01:08:36,759
of Conv, ReLU, pool
sequences, and then usually

1363
01:08:36,759 --> 01:08:39,701
just a couple of these fully
connected layers at the end.

1364
01:08:39,701 --> 01:08:42,221
But we'll also go into
some newer architectures

1365
01:08:42,221 --> 01:08:45,895
like ResNet and GoogLeNet,
which challenge this

1366
01:08:45,895 --> 01:08:49,755
and will give pretty different
types of architectures.

1367
01:08:49,756 --> 00:00:00,000
Okay, thank you and
see you guys next time.